Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Definitions and Concepts 1. Define variation and statistical models and then explain how they are related. (3 Points) Variation: Differences among individuals or items; also fluctuations over time. Pattern: A systematic, predictable feature in data. Statistical model: A breakdown of variation into a predictable pattern and the remaining variation. The process of creating a statistical model decomposes the total variation into explained (model) and unexplained categories. 2. Compare and contrast ordinal and numerical variables. (3 Points) Numerical variable: Column of values in a data table that records numerical properties of cases (also called continuous variable). Ordinal variable: A categorical variable whose labels have a natural order. When ordinal variables are coded as integers, then they may appear to be numerical. We need to be careful, however, in a situation like a Likert Scale because a value of 1 compared to 2 might not means that the 2 category is twice the size of 1. 3. Define a probability distribution and then explain how mean and variance are related to the normal distribution. (3 Points) A probability distribution is a function that assigns a probability to each possible value of a random variable. The normal distribution is an example of a probability distribution. 2 The mean and variance are the two parameters that uniquely define the normal distribution. 4. Compare and contrast the Central Limit Theorem and the Law of Large Numbers. (3 Points) Central Limit Theorem: The probability distribution of a sum of independent random variables of comparable variance tends to a normal distribution as the number of summed variables increases. Law of Large Numbers: The relative frequency of an outcome converges to a number, the probability of the outcome, as the number of observed outcomes increases. Both are dependent on the size of n getting larger. As the number of observations increases, if they are being summed, converges to the normal distribution no matter what the original distribution might have been. In the case of the Law of Large Numbers, as the sample size increases, then the distribution converges to the distribution from which the observations are being drawn. 5. Explain the calculation of the coefficient of variation (CV) and z-score. What do they share in common? (3 Points) The coefficient of variation is a unitless measure that is calculated by dividing the sample standard deviation by the mean: s x The calculation of the z-score is also a unitless measure and is calculated by subtracting the mean and dividing by the standard deviation (error): Xi X sX Both are unless measures and therefore can be interpreted independently of the units of measure. Problems 1. The cases that make up the dataset for this problem are the types of cars sold in the United States in 2011. The data include variables for city mpg, vehicle type, air aspiration, horsepower, and displacement of 319 vehicles. a. Use all of the relevant statistics and diagrams to discuss the following characteristics of the distribution of city miles per gallon. i. Location (3 Points) Mean: 19.4 (Triangle in boxplot) Median: 18 (middle vertical line in the boxplot) Measure of central tendency. ii. Scale (3 Points) Std Dev: 5.9 IQR: 7 (length of the rectangle in the boxplot. Whiskers also depend on the IQR since they extend out 1.5 times the IQR and then slide back until we encounter an observation.) Measure of how spread out the distribution might be. iii. Symmetry (3 Points) Skewness: 1.4 (positive or right skewness Median is less than mean Longer right whisker than the left whisker More large positive valued outliers Normal-quantile plot indicates that the values at the right of the distribution are too large relative to the normal and those on the left are too small relative to the normal. iv. Tail Thickness (3 Points) Kurtosis (Coefficient of Excess in JMP): 3.9 (thick tails relative to the normal) Presence of outliers Some values are larger than would be expected relative to the normal v. Normality (3 Points) The normal distribution has skewness and kurtosis both equal to zero. This distribution has values that are significantly bigger than zero. The normal-quantile plot shows that on the left hand of the distribution, the values are too small relative to the normal and on the right hand side they are too large. The very small p-value for the Shapiro-Wilks test allows us to reject the null hypothesis that the distribution is normal. b. Categorical Associations i. Define and explain the marginal distribution for Vehicle Type. (3 Points) The marginal distribution for the vehicle type is the last column in the contingency table. This gives the relative frequency of each type of vehicle: Both: 10.4% Car: 61.8% Truck: 25.8% ii. Define and explain the conditional distribution for Trucks. (3 Points) This is the fourth row the Trucks row in the contingency table and gives the probabilities for the type of aspiration given that we know that it is a truck: Natural: 89.0% Supercharged: 1.2% Turbocharged: 9.8% iii. Evaluate the relationship between Vehicle Type and Air Aspiration. (3 Points) If the type of vehicle and type of aspiration are independent, then that would be no advantage or additional insight that would come from predicting air aspiration given the type of vehicle or vehicle given the type of aspiration. However, there is an advantage to knowing the type of vehicle. For example, if we know that it is a truck, it is much more likely to have natural aspiration than a car. Or if we know that it is turbocharged, then it is more likely to be a car than a truck. If there was no relationship between vehicle type and air aspiration, then the lines in the mosaic chart would be parallel. Since they are not, this is evidence that there is a relationship between type of vehicle and air aspiration. The low p-value associated with the contingency table also allows us to reject the idea that there is no relationship. c. Quantitative Associations i. Explain the calculation and interpretation of the covariance. Why is this measure not very useful? (3 points) The covariance is the sum of the products of the deviations from the means. This procedure retains the units of measure for each of the variables. The covariance is calculated as x x y y i i i n 1 When observations are both above the means or below the means simultaneously, then we have a positive association. When x is above the mean and y below the mean, or x below the mean and y above the mean, then we have a negative association. Because the units are retained, this isn’t a very useful statistic. ii. Explain the calculation and interpretation of the correlation coefficient. Why is the measure much more useful? (3 points) If we standardize both x and y, then we subtract the appropriate means and divide by the standard deviation. This is very similar to the covariance formula: i xi x yi y sx sy n 1 This is the same as the product of two z-scores. Because z-scores don’t have any units, then the correlation coefficient is unitless. This gives an the advantage over the covariance that it can be more easily interpreted. iii. Discuss the direction, linearity, variance, and outliers of the relationship between horsepower and displacement. (4 points) direction: positive linear: seems appropriate variance: some increasing variation outliers: some of the observations are distant from the line in the graph and are potential outliers 2. A kicker in a US football game scores 3 points for a made field goal. A kicker hits 60% of the attempts taken in a game. N p 5 60% Field Goals Probability Cumulative Probability 0 0.01024 0.01024 1 0.0768 0.08704 2 0.2304 0.31744 3 0.3456 0.66304 4 0.2592 0.92224 5 0.07776 1 Probability 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 a. Explain why it might be appropriate to use a binomial distribution for this problem. (2 points) There are only two outcomes for each kick so we have a Bernoulli trial. The kicker will either make the field goal or miss it. A binomial is the sum of success in a Bernoulli trial. b. What assumptions are needed in order to model the number of successful kicks in a game as a binomial random variable? (2 points) The outcomes might not be independent if there was a lot of pressure in a given situation or if the kicker had succeed or failed in previous kicks. c. If the kicker tries five attempts during a game, how many points would you expect him to contribute to the team's score. Explain your calculation. (3 points) The expected value for 5 kicks is given as the product of the possible outcomes and their associated probabilities. Formally this would be: 0 .01 1.08 2 .23 3.35 4 .26 5.08 3 This could also be calculated as n p 5 0.6 3 . Since each field goal is worth 3 points, then the expected number of points would be 9. d. Explain two different ways you could calculate the standard deviation of the number of successful field goals scored by the kicker? (3 Points) The standard deviation could be calculated as the square root of the sum of the squares of the deviations multiplied by their respective probabilities or: 1 0 32 .01 1 32 .08 2 32 .23 3 32 .35 4 32 .26 5 32 .08 2 1.1 The formula for the standard deviation of a binominal is n p 1 p 5 0.6 1 0.6 1.1 e. Why would the standard deviation of the number of points be useful to the coach? (2 points) The standard deviation tells the coach how erratic the kicker might be. It is a measurement of the riskiness of attempting a field goal.