Download Midterm Key - Marriott School

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Definitions and Concepts
1. Define variation and statistical models and then explain how they are related. (3 Points)
Variation: Differences among individuals or items; also fluctuations over time.
Pattern: A systematic, predictable feature in data.
Statistical model: A breakdown of variation into a predictable pattern and the remaining variation.
The process of creating a statistical model decomposes the total variation into explained (model)
and unexplained categories.
2. Compare and contrast ordinal and numerical variables. (3 Points)
Numerical variable: Column of values in a data table that records numerical properties of cases
(also called continuous variable).
Ordinal variable: A categorical variable whose labels have a natural order.
When ordinal variables are coded as integers, then they may appear to be numerical. We need to
be careful, however, in a situation like a Likert Scale because a value of 1 compared to 2 might not
means that the 2 category is twice the size of 1.
3. Define a probability distribution and then explain how mean and variance are related to the
normal distribution. (3 Points)
A probability distribution is a function that assigns a probability to each possible value of a random
variable.
The normal distribution is an example of a probability distribution.
 
2
The mean    and variance  are the two parameters that uniquely define the normal
distribution.
4. Compare and contrast the Central Limit Theorem and the Law of Large Numbers. (3 Points)
Central Limit Theorem: The probability distribution of a sum of independent random variables of
comparable variance tends to a normal distribution as the number of summed variables increases.
Law of Large Numbers: The relative frequency of an outcome converges to a number, the
probability of the outcome, as the number of observed outcomes increases.
Both are dependent on the size of n getting larger. As the number of observations increases, if
they are being summed, converges to the normal distribution no matter what the original
distribution might have been. In the case of the Law of Large Numbers, as the sample size
increases, then the distribution converges to the distribution from which the observations are
being drawn.
5. Explain the calculation of the coefficient of variation (CV) and z-score. What do they share in
common? (3 Points)
The coefficient of variation is a unitless measure that is calculated by dividing the sample standard
deviation by the mean:
s
x
The calculation of the z-score is also a unitless measure and is calculated by subtracting the mean
and dividing by the standard deviation (error):
Xi  X
sX
Both are unless measures and therefore can be interpreted independently of the units of
measure.
Problems
1. The cases that make up the dataset for this problem are the types of cars sold in the United
States in 2011. The data include variables for city mpg, vehicle type, air aspiration, horsepower,
and displacement of 319 vehicles.
a. Use all of the relevant statistics and diagrams to discuss the following characteristics of
the distribution of city miles per gallon.
i. Location (3 Points)
Mean: 19.4 (Triangle in boxplot)
Median: 18 (middle vertical line in the boxplot)
Measure of central tendency.
ii. Scale (3 Points)
Std Dev: 5.9
IQR: 7 (length of the rectangle in the boxplot.
Whiskers also depend on the IQR since they extend out 1.5 times the IQR and
then slide back until we encounter an observation.)
Measure of how spread out the distribution might be.
iii. Symmetry (3 Points)
Skewness: 1.4 (positive or right skewness
Median is less than mean
Longer right whisker than the left whisker
More large positive valued outliers
Normal-quantile plot indicates that the values at the right of the distribution are
too large relative to the normal and those on the left are too small relative to
the normal.
iv. Tail Thickness (3 Points)
Kurtosis (Coefficient of Excess in JMP): 3.9 (thick tails relative to the normal)
Presence of outliers
Some values are larger than would be expected relative to the normal
v. Normality (3 Points)
The normal distribution has skewness and kurtosis both equal to zero. This
distribution has values that are significantly bigger than zero.
The normal-quantile plot shows that on the left hand of the distribution, the
values are too small relative to the normal and on the right hand side they are
too large.
The very small p-value for the Shapiro-Wilks test allows us to reject the null
hypothesis that the distribution is normal.
b. Categorical Associations
i. Define and explain the marginal distribution for Vehicle Type. (3 Points)
The marginal distribution for the vehicle type is the last column in the
contingency table. This gives the relative frequency of each type of vehicle:
Both: 10.4%
Car: 61.8%
Truck: 25.8%
ii. Define and explain the conditional distribution for Trucks. (3 Points)
This is the fourth row the Trucks row in the contingency table and gives the
probabilities for the type of aspiration given that we know that it is a truck:
Natural: 89.0%
Supercharged: 1.2%
Turbocharged: 9.8%
iii. Evaluate the relationship between Vehicle Type and Air Aspiration. (3 Points)
If the type of vehicle and type of aspiration are independent, then that would be
no advantage or additional insight that would come from predicting air
aspiration given the type of vehicle or vehicle given the type of aspiration.
However, there is an advantage to knowing the type of vehicle. For example, if
we know that it is a truck, it is much more likely to have natural aspiration than
a car. Or if we know that it is turbocharged, then it is more likely to be a car than
a truck.
If there was no relationship between vehicle type and air aspiration, then the
lines in the mosaic chart would be parallel. Since they are not, this is evidence
that there is a relationship between type of vehicle and air aspiration.
The low p-value associated with the contingency table also allows us to reject
the idea that there is no relationship.
c. Quantitative Associations
i. Explain the calculation and interpretation of the covariance. Why is this
measure not very useful? (3 points)
The covariance is the sum of the products of the deviations from the means.
This procedure retains the units of measure for each of the variables. The
covariance is calculated as
  x  x  y  y 
i
i
i
n 1
When observations are both above the means or below the means
simultaneously, then we have a positive association. When x is above the mean
and y below the mean, or x below the mean and y above the mean, then we
have a negative association. Because the units are retained, this isn’t a very
useful statistic.
ii. Explain the calculation and interpretation of the correlation coefficient. Why is
the measure much more useful? (3 points)
If we standardize both x and y, then we subtract the appropriate means and
divide by the standard deviation. This is very similar to the covariance formula:

i
 xi  x   yi  y 
sx
sy
n 1
This is the same as the product of two z-scores. Because z-scores don’t have any
units, then the correlation coefficient is unitless. This gives an the advantage
over the covariance that it can be more easily interpreted.
iii. Discuss the direction, linearity, variance, and outliers of the relationship
between horsepower and displacement. (4 points)
direction: positive
linear: seems appropriate
variance: some increasing variation
outliers: some of the observations are distant from the line in the graph and are
potential outliers
2. A kicker in a US football game scores 3 points for a made field goal. A kicker hits 60% of the
attempts taken in a game.
N
p
5
60%
Field Goals
Probability Cumulative Probability
0
0.01024
0.01024
1
0.0768
0.08704
2
0.2304
0.31744
3
0.3456
0.66304
4
0.2592
0.92224
5
0.07776
1
Probability
0.4
0.3
0.2
0.1
0
0
1
2
3
4
5
a. Explain why it might be appropriate to use a binomial distribution for this problem. (2
points)
There are only two outcomes for each kick so we have a Bernoulli trial. The kicker will
either make the field goal or miss it. A binomial is the sum of success in a Bernoulli trial.
b. What assumptions are needed in order to model the number of successful kicks in a
game as a binomial random variable? (2 points)
The outcomes might not be independent if there was a lot of pressure in a given
situation or if the kicker had succeed or failed in previous kicks.
c. If the kicker tries five attempts during a game, how many points would you expect him
to contribute to the team's score. Explain your calculation. (3 points)
The expected value for 5 kicks is given as the product of the possible outcomes and their
associated probabilities. Formally this would be:
0 .01 1.08  2 .23  3.35  4 .26  5.08  3
This could also be calculated as n  p  5  0.6  3 . Since each field goal is worth 3
points, then the expected number of points would be 9.
d. Explain two different ways you could calculate the standard deviation of the number of
successful field goals scored by the kicker? (3 Points)
The standard deviation could be calculated as the square root of the sum of the squares
of the deviations multiplied by their respective probabilities or:
1
 0  32  .01  1  32  .08   2  32  .23  3  32  .35   4  32  .26  5  32  .08 2  1.1


The formula for the standard deviation of a binominal is
n  p  1  p   5  0.6  1  0.6  1.1
e. Why would the standard deviation of the number of points be useful to the coach? (2
points)
The standard deviation tells the coach how erratic the kicker might be. It is a
measurement of the riskiness of attempting a field goal.