Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
PLS206 Spring 2007
Homework 1 – Answer Key
This file contains two worksheets, these Instructions, and Simulation. The goals of this exercise are to:
Learn what a parametric simulation is.
Learn the concept that the sample and the estimated model are not the same as the "real" model.
Understand that residuals (e) are not the same as true errors (epsilon).
Develop an intuitive feel for the effect of true parameters values and hypothesized model on the
estimated parameters and their estimated variances.
Describe the effects of changes in beta0, beta1, beta2 and sigma on the shape of the true model. You
can use hand-drawn pictures for the description.
beta0 adjusts the intercept, or where the true model crosses the Y axis; beta1 is the coefficient for the
linear portion of the model; beta2 is the coefficient for the non-linear (quadratic) portion of the model.
In general, beta2 controls the width of the parabola. In general, beta1 controls the slope of the linear
contribution of the model. beta2 controls the curvature of the graph. However, if beta1 is much much
greater than beta2 and much greater than the values of x, then the shape of the model becomes more
linear on the range of x's we are interested in. See figure for further details. Sigma controls the spread
of the points around the model, but not the shape of the model.
List all columns and/or cells that represent random variables and indicate what symbols are usually
used to designate them. If you have any edition of the recommended textbook, use the symbols in the
textbook. Ignore columns G and H.
All random variables are shaded in light blue in the Simulation worksheet. Any cell that canges when
you recalculate the worksheet is a random variable because it is ultimately dependent on the
randomness of the epsilon column.
Set beta2 to 0 and keep beta0 and beta 1 constant. Perform 20 simulations by pressing the key
combination indicated. For each simulation, make predictions for the yield expected for X=9 and X=18.
Make a table with columns for predicted yield for X=10, predicted yield for X=17, first individual Y
value observed for X=9 (cell B15), first individual Y value observed for X=18 (cell B24). This table will
have 20 rows.
The answer here should be a table with 4 columns and 20 rows. Each column should represent:
values for Y9, Yhat9, Y18, Yhat18, for each of twenty replications of the experiment. The values should
vary randomly around some mean value. The variance of the Yhats should be lower than the variance
of the Y’s – but not necessarily evident in your set of repetitions. The variance in the Y’s should
become greater at X values farther from Xbar.
Calculate the variance of each of the columns above. Discuss how they relate to the equations to
determine the variance of predictions of expected values of Y and individual values of Y.
The answer here is dependent on the actual numbers obtained in question 3. The variance can be
obtained for Yhati and Yi by calculating the sum of the squares of the deviations around the sample
mean (using the 20 individual values as a sample) divided by the degrees of freedom, SUM(YiYbar)^2 divided by n-1, and should be similar to the results obtained when the variance of these two
variables is calculated using the equations for estimates of variance of a prediction for the expected
value of yield given Xh, and for an individual yield given Xh. The equations calculate the variance of
these variables analytically using one repetition of the experiment.
Assume that each simulation represents a real experiment. Describe two ways to create a confidence
interval for beta1 and discuss the advantages and disadvantages of each one. Note that each
simulation run can yield estimates of beta1 and its CI.
There are at least two ways to go about calculating a confidence interval around beta1: (1) As with
the above table, a table of beta1 values can be simulated by repeating the experiment with the same
parameter values 20 or more times, using the shift F9 function. Then the mean and variance for beta1
can be estimated using the sample of beta1's generated by the repetitions and the following simple
equations can be used: Σ(b1i)/n and Σ(b1i-b1bar)^2 divided by n-1, respectively, for estimating beta1
and for estimating the variance of beta1. OR (2) the variance for b1 can be estimated using the
equation for s2{b1} given in the lectures by using a single rep of the experiment and a single sample of
Yi's, Ybar,
Yhat's, X
Yij Observed yield or dependent variable
values (or
Padded
Y Average of observed yields
values) and the
"slope"
Y. j Average of observed yield for level j of X
estimate given
in cell E31.
ˆ
Either way, the Y Estimated yield
variance
. j E Yij | X j AMean or expected value of yield for level j of predictor X
obtained can
be used to
eij Residual for observation i of level j of X
calculate a
confidence
ij Actual random error for observation i of level j of X
interval around
the estimated
beta1 using a t
̂0 b0 Estimated intercept
distribution: b1
± s{b1}*t(df, 1̂1 b1 Estimated slope or regression coefficient for predictor X
α/2). The
estimate for b1 ̂ 2 MSE Estimated variance of the error
is either the
SSR Sum of squares of the model or regression; explained sum of squares
average of the
beta1's from
SSE Sum of squares of the error or residuals; unexplained sum of squares
method (1) or
the b1 or
SST Total sum of squares of Y
"slope"
calculated in method (2).
Note that when using the first method, the dfe=n-1. In the second method dfe=n-2. Thus, the df used
to look up the t-value are different.
Answer problems 1.2, 1.5, 1.7, 1.12 and 1.16 in page 33 of the textbook.
1.2 Equation: Y=$300 + $2*X. This is characterized best as a “functional” not a “statistical”
relationship because in this example, there is no uncertainty associated with Y. Y IS
EXACTLY equal to the sum of the two terms: $300 (intercept) and $2*X (slope*X). There
would be no scatter around the line you plotted from this equation. If there were some scatter
around the line, or in other words, if there was some variation in Y not explained by the model,
then this might be better characterized as a statistical relationship.
1.5 The equation written in this question is incorrect. The value “E{Yi},” is the “expected value for
Yi” (when X is in the ith level, or when X=Xi). This value can be calculated as the mean of
the probability distribution for all Yi’s at that level of X, which is given by the equation Beta0 +
Beta1Xi (or the line – remember the picture depicting a regression line with normal
distributions centered around the line representing the possible values for Y at various levels
of X). Another way of looking at it is this: since the error terms are assumed to be normally
distributed, with mean 0 and variance σ2, then the expected value for εi at any given Xi is
E{εi}=0 (the expected value is given by the mean of the probability distribution of the random
variable ε). If E{εi} is 0, then plugging this information into the familiar equation:
Yi = Beta0 + Beta1Xi + εi,
we get…
E{Yi} = Beta0 + Beta1Xi + 0
OR…
E{Yi}=Beta0 + Beta1Xi.
1.7
(a) Regression model 1.1 is Yi = Beta0 + Beta1Xi + εi. In this example, we are given Beta0,
Beta1, and σ2 for the error term (the error term has by definition mean 0 and variance σ2),
which represent the TRUE parameters for the population. (Note: this is what makes this a
simulation experiment – in real experiments we NEVER will know the true parameters, only
better and better estimates of them the more data we collect – unless we collected data for
every single member of the population at once with perfect precision). Since we’ve been
given the actual parameter values, but not the probability distribution associated with ε, we
CANNOT calculate the exact probability that Y will fall between 195 and 205.
(b) Given a normally distributed error, we CAN state the exact probability that Y will fall
between 195 and 205 for X=5. Note first that the E{Yi} = 200 ( calculated from 100 + 20*5),
which represents the mean of this probability distribution. Therefore we can estimate the
mean value for Yi at X=5 with 100% precision. But, there is some variance around this mean
in the true population, denoted by σ2{Yi}. This variance describes the width of our normal
distribution for Y’s centered around 200.
The probability that Y is between 195 and 205 when X=5 is the same as the probability that z
will be between (195-200)/5 and (205-200)/5.
1.12(a) This was an observational study. The researcher did not manipulate any experimental
conditions in a controlled way (except to select participants).
(b) “correlation does not necessarily imply causation” Thus, the correlation detected by the
study may or may not be causal. The conclusion (as a factual statement) may or may not
be correct, but the method used to reach it is incorrect.
(c) This part of the question is asking you to think about “confounding” variables – those
factors that vary with both X and Y and may influence the relationship between X and Y.
Some examples include: (1) people who exercise more may lead generally more
healthful lifestyles (e.g. eat better, watch their weight, take vitamins, etc.) which may in
turn affect their health status; (2) People who are less ill may feel better in general and
may tend to want to exercise more (ie. which is the dependent vs. independent variable?).
(d) (Note: this is a tricky question – definitively showing cause and effect may require much
more information than a single study can provide, whether observational or experimental.
However, for the purposes of the question, one can suspend disbelief and describe some
ways to minimize the possibility that the relationship among X and Y is caused by some
unmeasured variable). This part of the question is asking about how you might minimize
the probability that confounding variables are influencing the results, thus isolating as
much as possible, the effects of X on Y. This is achieved, in theory, by control and
randomization. For example, a researcher could design an experiment by selecting two
groups who all start out healthy at the beginning of the study: the experimental group
(those that have a higher level of exercise) and the control group (those with a low level
of exercise). The experimental and control groups would be selected such that they are
as equivalent as possible in all other ways besides exercise level AND/OR participants
would be randomized such that the effects of unmeasured confounding variables are
likely to be evened out among the groups. Then the researcher would follow the groups
over time and measure the level of colds among each group. (Other ways to minimize
potential bias in the results can be noted, but are beyond the scope of the question. For
example, the researcher could design the study so that (1) people are assigned blindly to
groups – not really possible here since people know how much they are exercising; and
(2) the researcher measuring the effect could be blinded, e.g. kept unaware about the
group assignment of each participant.) There are other possible and equally valid
experimental designs that could be used – the key is that control and randomization will
hopefully either control for or even out the effects of unmeasured confounding variables
among groups.
1.16 The least squares method, by itself, is robust against deviations from normality and doesn't
assume any particular distribution. However, by convention, most linear models such as
linear regression and ANOVA assume a normal distribution of errors and Y's. It is important
to understand where this assumption of normality plays in...It is not in the least squares
regression itself, but in the interpretation and hypothesis testing that comes later. For
example, when one gets a “p value” in the various outputs from a statistical program such as
jmp (or hand calculations of p values), the value is determined by comparing a test statistic to
a distribution - usually a "t" distribution or an "F" distribution (which one is used depends on
the specific hypothesis or question). THIS STEP, selecting a probability distribution and
making inferences about the data in relation to the population being estimated, using these
distributions, is the step that is highly dependent on the assumption of normality in the errors
and Y's. Without the assumption of normality, the inference about the system using these
distributions would be invalid.