Download Quiz 9 - UF-Stat

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Regression toward the mean wikipedia , lookup

Categorical variable wikipedia , lookup

Transcript
Practice Questions for Exam 1
1. A used car lot evaluates their cars on a number of features as they arrive in the lot in
order to determine their worth. Among the features looked at are miles per gallon
(MPG), make and model (ie. Toyota Camry), number of cylinders, horsepower, weight,
and year made. List these variables and state whether each is quantitative or categorical.
Quantitative
Catergorical
MPG
make
# of cylinders
model
horsepower
weight
year
2. High temperatures for 35 major US cities were collected for January 23, 2006 and
were put into the stem plot below with leaf unit=1.0.
1
5
11
17
(3)
15
9
7
6
4
1
3
3
4
4
5
5
6
6
7
7
8
4
6689
001122
566678
113------------18th position is 51
678899
04
8
12
889
3
a) What is the minimum, median and maximum for this dataset?
min: 34 x 1.0 = 34
position of median:
max: 83 x 1.0 = 83
M = 51 x 1.0 = 51
b) Find the range for this dataset.
Range = max – min
83 – 34 = 49
1|Page
3. Test scores for a class of 15 economics students were as follows:
86 95 78 93 34 58 65 68 72 98 92 84 73 84 91
34 58 65 68 72 73 78 84 84 86 91 92 93 95 98
M
a) Find the mean, median, and mode.
̅ = 78.06
Pos. med. =
Mode =84
M = 84
b) Find the IQR, range, standard deviation and variance.
IQR =
range: 98 – 34 = 64
-
s = 17.07
= 291.39
92 – 68 = 24
c) Create a stem and leaf plot.
3
4
5
6
7
8
9
4
8
58
238
446
12358
d.) Create a boxplot.
30
40
50
60
70
80
90
100
4. As they left the movie theater in Gainesville, 17 people were asked how long they had
to wait in line for their tickets.
a.) What is the population of interest?
All Gainesville residents that go to the movie theater
2|Page
b.) What is the sample?
17 people
c.) What is the variable being measured?
Time spent in line
d.) Is this variable discrete quantitative, continuous quantitative or categorical?
continuous quantitative
The results of the above question were as follows:
Histogram of Minutes
6
Frequency
5
4
3
2
1
0
0
4
2
6
8
Minutes
12
16
14
18
e) Describe the shape, center, and spread of this histogram.
Shape: Skewed right
Center: 4 to 6
Spread: 0 to 18
f) Are there any outliers?
The value at 18 is might be considered an moderate outlier.
3|Page
5. Last year a small accounting firm paid each of its five clerks $22,000, two junior
accountants $50,000 each, and the firm’s owner $270,000.
a) What is the mean salary paid at the firm?
̅ = 60,000
b) How many employees earn less than the mean?
7 employees earn less than the mean
22,000
22,000
22,000
22,000
22,000
50,000
50,000
270,000
c) What is the median salary?
d) What does this tell us about the mean and the median?
The mean is affected by outliers, but the median is not.
6. Two students took the same English course. Their grades were based on five
compositions. The grades are as follows.
Comp. 1
Comp. 2
Comp. 3
Comp. 4
Comp. 5
Student A
78
52
84
95
92
Student B
84
63
80
92
89
a) Find the mean and standard deviation for each student.
Student A
̅ = 80.12
s = 17.12
Student B
̅ = 81.6
s = 11.4
b) Which student has more variability in their scores?
Student A, because the standard deviation is higher.
4|Page
7. The following are the golf scores of 12 members of a women’s golf team in
tournament play:
89 90 87 95 86 81 102 105 83 88 91 79
a) Display the distribution by a stemplot and describe its main features.
7
8
9
10
9
136789
015
25
Center- upper 80’s
Spread – 79 to 105
Shape – bell
No outliers
b) Compute the mean, variance and standard deviation of these golf scores.
Mean: ̅ = 89.67
Standard deviation: s = 7.83
Variance:
= 61.31
c) Then compute the median, the quartiles and the IQR.
79
81
83
:
86
87
M:
88
89
90
91
95
102
105
:
IQR: Q3 - Q1= 93-84.5 = 8.5
d) Are there any outliers?
No
5|Page
8. Colleges and universities are requiring an increasing amount of information about
applicants before making acceptance and financial aid decisions. Classify each of the
following types of data required on a college application as discrete quantitative,
continuous quantitative or categorical.
a) High School GPA
Continuous quantitative
b) Gender of applicant
Categorical
c) Parent’s income
Continuous quantitative
d) High School class rank
Discrete quantitative
9. Answer the following questions:
a) What is the primary disadvantage of using the range to compare the variability of
data sets?
The range is heavily influenced by outliers
b) Can the variance of a data set ever be negative?
No
c) The variable of interest is height which is measured in inches. What is the unit of
the standard deviation? What is the unit of the variance?
Units for standard deviation is inches
Units for variance is squared inches
d) Give an example of a dataset where the standard deviation equals 0.
Any dataset that has all of the same numbers
Example: 5 5 5 5 5
6|Page
10. Describe the following scatterplots.
a) Elevation ( in meters) versus mean annual temperature(in Centigrade).
Negative
Linear
No outliers
Strong
Avg. Gas Price
b). Year vs. Gas Price Average from 1976 to 2004.
2.00
Positive
1.75
Linear
1.50
No outliers
1.25
Moderate
1.00
0.75
0.50
1980
1985
1990
Year
1995
2000
2005
7|Page
c). Age (in months) vs. Score on a Cognitive Abilities Test
Positive
36
34
Linear
32
2 potential
outliers
Age
30
28
26
strong
24
22
20
15
20
25
Score
30
35
8|Page
11. A least squares regression was fit to the data shown above of Year vs. Gas Prices.
The result was the following:
Fitted Line Plot
Avg. Gas Price = - 44.78 + 0.02310 Year
2.00
y- intercept
Avg. Gas Price
1.75
S
R-Sq
R-Sq(adj)
slope
0.210031
47.3%
45.3%
1.50
1.25
1.00
0.75
0.50
1980
1985
1990
Year
1995
2000
2005
a). Identify the explanatory and response variables.
explanatory(x) = year
response(y) = avg. gas price
b). What is R2 and what is its interpretation?
R2 = 47.3 % of variation of gas price is explained by year
c). What is the slope and what is its interpretation?
Slope = 0.02310,
0.02310 is the average change in gas price every year
d). What is the y-intercept and what is its interpretation?
-44.78,
do not interpret {No data around x = 0}
e). What would you predict the gas price to be in 2030? Is this reliable?
-44.78 + 0.02310(2030) = 2.113
Not reliable, 2030 is too far away from our data.
9|Page
12. Eleven members of a golf team play two rounds a piece. A bystander wants to
predict the round 2 score based on the round 1 score. Their scores are as follows:
Round 1
89
90
87
95
86
81
105
83
88
91
79
Round 2
94
85
89
89
81
76
89
87
91
88
80
Round 1 = x
Round 2 = y
̅ = 88.5
𝑥
̅ = 86.27
𝑦
sx = 7.13
sy = 5.31
a). The least squares regression line minimizes the sum of the
squared residuals.
OR
The least squares regression line minimizes the sum of the
squared distances of the points to the line.
b). Identify the explanatory and response variables.
Explanatory variable: Round 1
Response variable: Round 2
c). What is R2 and what is its interpretation?
R2 = (.549)2 x 100 = 30.1%
OR
1.) square r – 0.5492 = 0.301
2.) Make it a decimal by moving the decimal two places to the right. 30.1%
30.1% is the percent of variation of round 2 scores that can be explained by round 1
scores
d). What is the slope and what is its interpretation?
sy
5.31
 0.40
sx
7.13
Slope: 0.41 is the average change in round 2 golf scores for a one point change in
round 1 golf scores
br
 0.549
10 | P a g e
e). What is the y-intercept and what is its interpretation?
a  y  bx  86.27  0.40(88.55)  50.85
No round 2 scores near zero. Do not interpret.
e). What is the least squares regression equation?
yˆ  a  bx
LSR Line : yˆ  50.85  0.40 x
f). Find the residual for golfer 7 (round 1=105, round 2=89).
obs x = 105 obs y = 89
yˆ  50.85  0.40 * (105)
Residual = obs y – pred y = 89 – 92.85 = -3.85
g.) When this point is removed, the value of r changes to 0.661. Is that point an
influential outlier?
Yes, because r changed a lot.
11 | P a g e
13) The following plot shows how much water (cubic kilometers) was released by the
Mississippi River in the years 1954-1980.
Fitted Line Plot
Water Release = - 14837 + 7.808 Year
900
S
R-Sq
R-Sq(adj)
Water Release
800
120.066
21.7%
18.6%
700
600
500
400
300
1955
1960
1965
1970
Year
1975
1980
a). What is the correlation between year and water release?
√
1) Change to decimal
2) Take square root
3) Determine sign
b). In 1973, a major flood occurred. That year, the river discharged 880 cubic kilometers
of water. Find the residual for this point.
observed x
observed y
yˆ  14837  7.808(1973)  568.184
residual = obs y – pred y
= 880 – 568.184
= 311.816
12 | P a g e
If we remove the observation from 1973, our new plot is:
Fitted Line Plot
Water Release = - 12468 + 6.598 Year
S
R-Sq
R-Sq(adj)
800
103.614
21.3%
18.0%
Water Release
700
600
500
400
300
1955
1960
1965
1970
Year
1975
1980
c). How did the LSR equation change? R2?
The slope went down, y intercept went up, and R2 went down a little.
d). Was 1973 an influential outlier?
No, the line didn’t change very much and R2 only changed a little.
14. For each of the descriptions below, determine what type of mistake is taking
place: extrapolation or misuse of cause and effect.
a). There is a high correlation between being a newspaper subscriber and having a high
income. Should I subscribe to the newspaper if I want to make more money?
Misuse of cause and effect, high correlation does not mean causation
b). 15 year-old Abby is an aspiring professional golfer. For the last 5 years, she has
recorded her average score at a local course. Using this information, she predicts what
her average score will be when she is 25.
Extrapolation, 25 is too far away from the data observed
13 | P a g e
c)In any given city, the number of churches and the number of bars are highly correlated.
Does church attendance cause drinking?
Misuse of cause and effect, high correlation does not mean causation. There could
be lurking variables such as: the number of people in a city will cause both the
number of churches and the number of bars to increase.
15. A student who waits on tables at a Chinese restaurant in a college neighborhood
records the cost of means and the tip left by single diners. The student wants to predict
the tip based on the price of the meal. r = 0.954
(x) Meal
$4.50 $5.79 $6.24 $4.62 $6.35
̅ = 5.5
sx = 0.884
(y) Tip
$0.50 $0.75 $0.85 $0.60 $1.00
̅ = 0.74
sy = 0.1981
a) Compute the least-squares regression line for these data.
(
̅
)
(
̅
)
LSR Line
𝑦
𝑥
b) Make a scatterplot of the data and draw the regression line on your plot.
1.00
𝑦
( )
𝑦
( )
.75
Now plot (4, .4193) and (6, .8469), and
draw a line through the points.
.50
.25
4
4.5
5 5.5 6 6.5
(Cost of meal in $)
14 | P a g e
c) The next diner orders a meal costing $4.89. Use your regression line to predict the
(
)
⏞
tip.
Predicted tip 61cents
16. Below are boxplots for the amount of calories for the different types of cereals based
on whether they are on the bottom, middle or top shelf. Shelf 1 is the bottom shelf, shelf
2 is the middle shelf and shelf 3 is the top shelf.
Boxplot of calories vs shelf
175
calories
150
125
Median
100
75
50
1
2
shelf
3
about 110 calories 1.) What is the median amount of calories for boxes on the top shelf?
top_
2.) Which shelf has the largest spread?
Symmetric _3.) What is the shape of the distribution of calories for boxes sold on the
top? (right, left, symmetric) (You can know that a boxplot is symmetric, but you can not
determine from a boxplot that it is normal).
15 | P a g e
17. Answer the following questions.
a.) What is the strongest value of correlation? 1 or -1
b.) What is the measure of center that is most influenced by outliers? Mean
c.) What is the measure of spread that is the most influenced by outliers? Range
d.) What percentage of the data is less than Q1? 25%
e.) What percentage of the data is less than Q3? 75%
f.) What is an influential outlier? This is when there is a point outside of the trend
of most of the data and when the point is removed the slope of the line
changes drastically and the value of R2 changes substantially.
g.) What type of graph would you use to explore the relationship between two
quantitative variables? scatterplot
h.) What type of graph would you use to explore the relationship between a
quantitative variable and a categorical variable? boxplots
i.) What type of graph would you use to explore the relationship between two
categorical variables? Contingency table
16 | P a g e
18. Below are boxplots of foal weights divided by gender(M=Males and F=Females).
Answer the following questions about the plot.
Boxplot of Weight vs Gender
130
Weight
120
110
100
90
F
M
Gender
a.) What is the median of the male foals? _____about115_________________
b.) Approximate the IQR for the female foals. ___Q3-Q1=128-95 =33______
c.) Compare the centers of the weights of the male and female foals. Use a complete
sentence.
The median weight for the female foals is slightly lower than the median weight of
the male.
d.) Compare the spread of the weights of the male and female foals. Use a complete
sentence.
The IQR and the range for the female foals is larger than for the male foals.
e.) Compare the shapes of the two distributions. Use a complete sentence.
The shape of the distribution for the female foals is fairly symmetric, but the shape
of the distribution for the male foals is slightly right skewed.
17 | P a g e