Download 156S05 sample final with solutions 1. The data below represents the

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
156S05 sample final with solutions
1. The data below represents the ages of 30 randomly selected USC students:
20 18 26 21 24 38 19 24 27 17 23 19 20
19 20 48 22 25 22 25 18 21 18 18 22 42
17 27 21 23
a) Make a stem and leaf plot of this data
1
1 778888999
2 0001112222344
2 55677
3
38
42
48
b) Use your calculator's mean and standard deviation functions to find the mean and
standard deviation.
Enter the data into a list. Use 1var to find x-bar and s.
c) Give the five number summary (min Q1 median Q3 max) for this data (be sure to
compute Q3 and Q1 according to the method described in the text).
With 30 students the median is the average of the 15th and 16th values. 21.5
The quartiles are at position 8 in each half of the data. 19 & 25
d) Make a boxplot of this data
________
|----|-----|-------|----------------------------------------------|
17 20 23 26 29 32 35 38 41 45 48
e) Which of the values computed above (mean or median) best represent the center
of data set? Use the median since the data isskewed
f) Which values (standard deviation or the 5 number summary) could be used to best
describe the spread of the data? 5 number summary since skewed
g) How would you describe the shape of the data set? skewed right
2. Scores on the SAT follow a normal distribution with mean 500 and standard
deviation 100.
a) Use the 68-95-99.7 rule to give a range of that includes 68% of all scores.
500±100 or (400,600)
b) What fraction of the scores are between 480 and 675?
z = (480-500)/100 = -.2 , p = .1151
and z = (675-500)/100 = 1.75 , p = .9599
Answer: .9599 - .1151
c) How high must a score be to be higher than 75% of all scores?
From the z-table z is about .67 (from t-table we can find z = .674). Now the
value is mu + z sigma = 500 + .674*100
3. Consider the data below of 9 quiz scores, before and after a 20 minute review
session:
student before after
1
11
12
2
13
14
3
10
11
4
11
12
5
12
14
6
9
11
7
8
10
8
11
13
9
11
11
before
after
r = .88389
N
9
9
MEAN
10.667
12.000
STDEV
1.500
1.414
r-sq = .78125
a) Make a scatterplot of this data
Done in class
b) Describe the relationship (shape strength direction)
linear
moderate-to-weak
positive
ˆ = a + bx which would predict after
c) Find the equation of the least squares line y
scores from before scores using
b=r
sy
sx
, a = y − bx
and the values provided
above. b = .88389*1.414/1.5 = .8332 a = 12 - .8332*10.667 = 3.11
equation: y-hat = 3.11 + .8332 x
d) Plot this line on your graph. Indicate the coordinates of the points you chose to plot
the line.
You need to show me (x,y) pairs. Pick x values on each side of the graph, say
x = 8 and x = 13 and put into the equation to get the y's. ( 8 , 9.78) (13,13.94).
Draw the line through the two points.
e) What does the line predict as an after score if before is 15? 15.608
Would you be willing to accept this prediction? I have some doubt about the
predicted value. Why or why not. We have no data with x values as big as 15,
so we do not know what will happen.
f) What percent of the variation in after scores is explained by the linear relationship
determined by the line? r-squared is .78.125 so 78.125%
4. a) Define the terms:
lurking variable
sampling distribution of a statistic
confounding
simple random sample
Look these up. Any term in a box is fair game on the exam.
b) Explain what the difference is between outliers and influential observations in
regression. Outliers are far from the line. Influential observations have
extreme x values and when added or removed change the position of the
regression line. They pull the regression line towards them and may not be
far from the line.
5. It is suspected that first time freshmen and transfer students will react differently to
a new on-line format for Math 156. Outline a randomized block experiment that could
be performed on 40 freshmen and 40 transfer test subjects, to determine if the new
course is more effective at delivering the material than the old one.
Separate the students into freshmen and transfer student groups. Randomly
assign each group to on-line and traditional classes. Compare performance
at the end of term for each group.
Explain what could go wrong if the researchers simply carried out a randomized
comparative experiment without blocking.
If one group has better results with the on-line class and the other with
traditional teaching, combining the groups could give the apparent result
that the two methods produce similar results.
6. Consider the probabilities listed below for the eye color of a randomly chosen USC
student:
blue
.19
brown
.58
green
??
other
.12
a) What must be the green probability? 1 - (.19 + .58 + .12)
b) What is the chance a randomly chosen student doesn’t have blue eyes? 1 - .19
c) Consider the experiment of randomly selecting two USC students. What is the
probability that both have blue eyes? (.19)(.19)
Neither have blue eyes? (1-.19)(1 - .19)
7. Suppose 60% of USC students work more than 20 hours a week.
a) If you randomly sample 8 students and count the number that work more than 20
hours a week, what is the probability that this number is 5?
Use P(X = k) = nnCk k p k (1 − p)n − k with n = 8 k = 5 p = .6 to get .27869
What is the probability that this number is 5 or more?
Here you need to sum .27869 with the corresponding three probabilities
coming from k = 6 , 7 and 8 in the formula above.
b) Compute the mean and standard deviation of the number in your sample of 8
students that work more than 20 hours a week.
mean = np = 8*.6
standard deviation = sqrt( np(1-p) ) = sqrt( 8 * .6 * .4 )
c) What is approximate probability that in a sample of 60 students, more than 40 work
more than 20 hours a week?
Use z =
x − np
with x = 40 n = 60 p = .6 to get z = 1.05 and prob = 1 - .8531
np(1 − p)
8. Weights of bags of Fritos coming out of the factory follow a normal distribution with
mean 1.8 ounces and standard deviation .15 ounces.
a) What is the probability a randomly chosen bag weighs over 2 ounces?
z = (2 - 1.8)/.15 = 1.33 probability = 1 - .9082 = .0918
b) Compute the mean and standard deviation of the average weight of three randomly
chosen bags. mean = 1.8 standard deviation = .15/sqrt3 = .0866
c) What is the probability that the average weight of three randomly chosen bags is
above 2 ounces? z = (2 - 1.8)/ .0866 = 2.31 prob = 1 - .9896 = .0104
9. Suppose a 95% confidence interval for the mean is quoted to be (43.5, 183.9). Are
sampled values likely to be in this interval? Why or why not? Probably not. The
interval explains where the population mean is, not individual values. Large
samples give small intervals which would contain a small percentage of data
values.
10. a) Compute a 95% confidence interval if your sample contains 345 items, the
sample mean is 8.3 and the population standard deviation is 6.34.
Note: we are given the population standard deviation so it is a z-interval.
8.3±1.96 * 6.43/sqrt345
b) What assumptions must be satisfied for the interval to be valid?
Since this is a large sample, need only SRS. For small samples we would need
a normally distributed population.
11. Suppose you wish a 95% confidence interval to estimate the mean to within .175 .
If the population standard deviation is 1.89, how large a sample must be taken?
n = (1.96*1.89/.175)^2 = 448.08 so n = 449
12. Suppose a hypothesis test results in a p-value of .002. What can you say about the
null and alternative hypotheses?
Reject Ho. Accept Ha. There is very strong evidence against Ho. Random
variation alone almost never could produce the observed results.
13. Perform a hypothesis test of H 0 : µ = 2.5 vs. H a : µ ≠ 2.5 if x = 2.96 , n = 12,
and σ = 1.2 . Quote a p-value in your conclusions.
Note: this is a two-sided test. The p-value will be double the tail probability.
Since the population standard deviation is given we use z as the test statistic.
If the sample standard deviation was given we would compute t .
z = 1.327 tail prob = 1 - .9082 = .0918 p-value = .1836. Fail to reject Ho. No
evidence in favor of Ha. Random variation can easily produce data of the
type observed.
b) What assumptions need to be satisfied for your results to be valid?
With small samples such as this one we need to know the population has a
normal distribution. As usual we also need to know the data was collected by
a SRS.
14. Construct a 95% confidence interval for the difference of the two population means
if the first sample has n=25, mean 1.5 and standard deviation .25 and the second
sample has n=22, mean 1.85 and standard deviation .35.
Using x1 − x2 ± t *
s12 s12
+
we get (1.5 - 1.85) ± t* sqrt( .25^2/25 + .35^2/22).
n1 n2
The t* is found using df = 21 to be 2.080.
15. Recall the data from problem 3:
student before after
1
11 12
2
13 14
3
10 11
4
11 12
5
12 14
6
9 11
7
8 10
8
11 13
9
11 11
Perform a hypothesis test to determine if the review session was effective. You need
to first determine if this test should be a matched pairs procedure or an independent
sample procedure.
a) Explain your choice. Since the data comes from sampling each student twice
the samples are not independent. It is a matched pairs design.
We first find the differences: after - before. Positive values indicate the
review worked.
student before after after-before
1 11 12
1
2 13 14
1
3 10 11
1
4 11 12
1
5 12 14
2
6
9 11
2
7
8 10
2
8 11 13
2
9 11 11
0
Ho: mu = 0 .
Ha: mu > 0
xd
Now compute t =
= 5.65
sd / n
b) State your results using a p-value. df = 8. p-value less than .0005 from tables.
Reject Ho. Accept Ha. Very strong evidence against Ho. Random variation is
almost surely not the cause of the observed data.
c) What assumptions need to be satisfied for your results to be valid?
SRS of students, Normally distributed differences
16. A recent poll resulted in 184 of 350 students in favor of a student fee to support the
USC math learning center.
a) Produce a 95% confidence interval for the proportion of USC students as a whole
that are in favor of the fee. p-hat = 184/350 = .5257
interval: .5257 ± 1.96* sqrt( .5257*.4743 / 350 )
b) Perform the hypothesis test of H 0 : p = .5 vs. H a : p > .5 . Report a p-value in
your conclusions.
p̂ − p0
The test statistic is z =
= .9621 Use .5 for p-zero and n = 350.
p0 (1 − p0 )
n
p-value = 1 - .8315 = .1685. Fail to reject Ho. No evidence in favor of Ha.
Random variation can easily produce data of the type observed.
c) What information would you need to know to determine if the interval and
hypothesis test are valid? SRS , population at least 10 times sample size
17. What sample size is required so that a 95% confidence interval for the population
proportion has a margin of error of .04:
a) If you think the proportion is about .8 ? (1.96/.04)^2 .4(1-.4) = 576.24 so n = 577
b) If you have no idea about the proportion? (using p=.5 gives the largest possible n)
(1.96/.04)^2 (.5)(.5) = 600.25 so n = 601
x−µ
z-score z =
σ
Binomial
µ = np and σ =
t-procedures
Two Sample
Proportion
σ
z - procedure x ± z *
n
x ±t*
s
n
np(1 − p) z =
t=
x−µ
s/ n
s12 s12
x1 − x2 ± t *
+
n1 n2
p̂ ± z *
p̂(1 − p̂)
n
z=
t=
x−µ
z=
σ/ n
⎛ z *σ ⎞
n=⎜
⎝ m ⎟⎠
x − np
np(1 − p)
xd ± t *
Matched pairs
sd
xd
t=
n
sd / n
x1 − x
s12 s12
+
n1 n2
p̂ − p0
p0 (1 − p0 )
n
2
2
⎛ z *⎞
n = ⎜ ⎟ p * (1 − p*)
⎝ m⎠