Download Midterm Key - Marriott School

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Definitions
1. Define sampling distribution and explain how it relates to the standard error of the mean. (2
Points)
Sampling distribution: The probability distribution that describes how a statistic, such as the
mean, varies from sample to sample.
The location and scale of the sampling distribution for X are given by X and the standard
error of X which is
sX 
s
n
This suggests that the standard error of the mean is a measurement of the variation in the mean
from sample to sample.
2. Compare and contrast two-sample comparisons with paired comparisons. (3 Points)
Two sample comparison tests for the difference between the means of two different samples.
No relationship exists between the observations in the two samples. The samples are
independent of each other.
Paired comparison: A compare of two treatments using dependent samples designed to be
similar. Observations in each sample naturally associate with each other. This often occurs by
exposing the same experimental unit to two different treatments.
Both the two sample and paired comparisons are used to measure the difference between two
groups. Pairing isolates the effect of the treatment, reducing the random variation that can hide
a difference.
3. Define the Tukey Bulging Rule and explain how it can help determine appropriate
transformations. (3 Points)
Tukey’s bulging rule uses the accompanying diagram to suggest appropriate transformations
whenever we encounter a nonlinear pattern. This process is usually iterative as we match the
pattern that is found in our data with the shape found in each of the four different quadrants in
the diagram. After matching the pattern in our data with one of the shapes in the four different
quadrants, we can then go up or down the ladder of powers for the x-axis and/or y-axis in order
to transform the nonlinear pattern into a straight line.
Y3
2
Y
Y up :
Y  1
X down :
log  X  ,
X

X up :
2
X ,
X3
Y down :
Y
log(Y )
4. Define elasticity and explain how you can use simple regression to estimate a value for elasticity.
(2 Points)
Elasticity: The ratio of the %change in Y or the response variable to the % change in X or the
explanatory variable.
The slope coefficient in a log-log regression gives an estimate of the elasticity.
5. Compare and contrast one-tailed versus two-tailed hypothesis tests. Use an example to
illustrate your answer. (3 Points)
One-sided hypothesis: Hypotheses in which the null hypothesis allows any value of a parameter
larger (or smaller) than a specified value.
H 0 :   0
H A :   0
Two-sided hypothesis: Hypotheses in which the null hypothesis asserts a specific value for the
population parameter.
H 0 :   0
H A :   0
Graphically, the two-tail test corresponds to the top t-test and includes the areas in both the left
and right of the distribution. The one-tailed test for the less than null hypothesis is the middle ttest and corresponds to the probability given in the right tail. The one-tailed test for the greater
than null hypothesis is the probability below the t-value.
Problems
Before purchasing videoconferencing equipment, a company ran tests of its current internal computer
network. The goal of the tests was to measure how rapidly data moved through the network given the
current demand on the network. Eighty files ranging in size from 20 to 100 megabytes (MB) were
transmitted over the network at various times of day, and the time to send the files was recorded. The
attached JMP output reports the results of a simple regression model with Y given by Transfer Time and
X given by File Size. Use the JMP output to answer the following questions.
1. Define, discuss, and compare the conditional and unconditional means of the response variable
Transfer Time. (3 Points)
Unconditional mean: the expected value or mean of a distribution without reference to any
possible explanatory or predictor variables. E Y   Y  25 . There is no value for file size
specified so we use the simple average over the entire sample.
Conditional mean: the average or expected value of one variable given that another variables
takes on a specific value. E Y | X  x  0  1 x  7.3  0.3x . We condition our expected
value for Transfer Time on the file size.
2. Use the appropriate parts of the JMP output to discuss and illustrate how total variation relates
to statistical models. (3 Points)
The statistical model divides total variation into that which is explained by the model and that
which is not explained.
The total variation to be explained is the sum of the squares of the deviations from the mean or
 Y  Y 
n
i 1
2
i
This is the variation around the unconditional mean. The unexplained variation is the sum of the
squares of the residuals which is given by

n
i 1
Y  ˆ  ˆ X   
2
i
0
1
i
n
e2
i 1 i
This is the variation around the conditional mean. The square root of the average of this sum is
the Root Mean Squared Error (RMSE) or the standard error of the regression, which is 6.24.
The R-Squared tells us the percentage of total variation that is explained by the statistical model.
In this case, the statistical model explains approximately 62% of the variation. It then follows
that 38% of the variation is not explained by the unconditional mean.
3. Evaluate and explain the five conditions for the simple regression model (SRM) (5 Points)
a) Linear
The scatterplot suggests that the data conforms to a straight line.
b) no lurking variables
In the case of a simple regression model, there are almost always lurking variables. In this
case, the time of the day, the type of file, or a number of other possible explanatory
variables might also affect transfer time.
c) equal variance
The residuals for small, medium, and large file sizes appear to be approximately the same
distance, on average from the regression line.
d) Independence
In order to gauge independence, we need to have a scatterplot of the residuals. If the
scatterplot has no patterns and appears random, then we have independence among the
observations. Because we don’t have a scatterplot, we can’t judge whether we have
independence or not. Sorry about the omission.
e) Normal
The normal-quantile plot shows that most residuals locate along the straight line within the
confidence band. There are some deviations from the straight line in the middle of the
graph.
The skewness coefficient and the coefficient of excess aren’t extremely large.
The Shapiro-Wilks test fails to reject the proposition that the errors are normally distributed.
This is a case where can continue cautiously under the assumption that the residuals are
normally distributed. There is some concern about normality, however, so we should
proceed cautiously.
4. Use the information about the slope coefficient to explain:
a) How the t-ratio is calculated and interpreted. (1 Point)
The t-ratio is calculated by dividing the estimated slope parameter by its standard error or
0.31
 11.4
0.03
b) Specify the hypothesis that is being tested. (1 Point)
H 0 : 1  0
c) Explain the results of the hypothesis test. (2 Points)
The slope coefficient is 11.4 standard errors from the hypothesized value. This gives the very
low p-value reported in the output. The low p-value means that we should reject the null
hypothesis.
5. Interpret the intercept and slope coefficients. Your answer should include a discussion of the
units in which they are measured. (2 Points)
The intercept term might be interpreted as the computer’s setup time before the actual file
transfer begins. This means that no matter what the while size, the computer takes
approximately 7 time units before the file transfer starts. The units of the intercept term are the
time units
The slope coefficient of 0.31 means that increasing the file size by 1 megabyte increases the file
transfer time to increase by 0.31 time units. Since the time units aren’t specified, we can only
say that the units of the slope coefficient are time units per megabyte.
6. Use the regression output to define and discuss confidence intervals versus the prediction
intervals for the regression line. (3 Points)
As mentioned, the statistical model decomposes total variation into that which is explained by
the model and that which is unexplained. The unexplained part of the variation or the residuals
are therefore one source of uncertainty. The other sources of uncertainty occur because the
intercept and slope have been estimated, and both have standard errors.
As mentioned, the statistical model is a conditional expectation or mean. If we transfer multiple
files which are of the same size and then take the average of these transfers, this would
correspond to the conditional mean for a given file size. The confidence interval reflects the
variation that occurs from multiple transfers of the same file size. The standard errors of the
intercept and slope contribute to this uncertainty. Because we are taking an average, the
residuals cancel each other out so the standard error of the regression isn’t included in this
calculation.
If we don’t have the benefit of multiple file transfers and are only predicting a single occurrence
or one specific file of a given size, then the residuals can’t cancel each other out and more
uncertainty exists. This means that the standard errors of the intercept, slope, and regression all
are part of the uncertainty that is summarized in the prediction interval.
The addition of the standard error of the regression to the standard errors of the slope and
intercept cause the prediction interval to exceed the confidence interval in width. The
prediction interval accounts for all variation, both explained and unexplained.
7. Explain how you would use the estimated model to predict file transfer time. (1 Point)
We use the estimated equation and substitute in different file sizes to predict how long it will
take to transfer each file:
Yˆ  0  1 X i
 7.3  0.3  X i
For example, a 10 megabyte file should take 10.3 time units to transfer
10.3  7.3  0.310