Download STAT301 Solutions 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
STAT301 Solutions 3
In this homework we consider how to design an experiment correctly (without bias), evaluating normal probabilities and applying the central limit theorem.
(1) A teacher asks her class ”How many children are their in your family, including yourself”? The mean response for her class is 3 children per family.
According to the U.S. Census Bureau, in 2012 the number of children per household
is 1.86 children. Why is the teacher’s sample biased towards higher outcomes?
The US Bureau includes all household, those with and without children.
Whereas when the teacher asks her class, she is sampling from the subpopulation where there already are children in the class. Thus her sample is
not representative of all households, and is biased towards households with
children. Since she is only including households with children it will be
higher than a sample which is a mix of childless and children households.
(2) What is wrong with the following randomization procedures? How would you improve
them?
(a) Twenty students are to be used to evaluate a new treatment. Ten men are assigned
to receive the treatment whereas ten women are assigned to be controls.
There is a problem of confounding, since it will be inclear whether it
was the treatment or the gender which influenced the outcome.
To prevent this type of confounding when assigning the treatment it
should be done randomly over both males and females.
(b) Ten subjects are assigned to two treatments, supposedly 5 subjects to each treatment. For each of the ten subjects a coin is tossed. If the coin comes up heads the
subject is assigned to the first treatment, if the coin comes up tails the subject is
assigned to the second treatment.
In theory this is a good idea, the only problem is that the number of
heads on 10 tosses may not be exactly 5. Therefore there may not be
an equal number in each treatment. To ensure that there is an equal
number in each treatment the 10 names could be put in a hat and the
first 5 names to be pulled out would be put one treatment and the
other five names in another treatment.
(3) A patient is classified as having gestational diabeties if the glucose level is above 140
miligams per deciliter one hour after ingesting a sugary drink. Lucy’s measured sugar
level varies according to a normal distribution with mean µ = 125mg/dl and standard
deviation 10mg/dl.
Since the her mean level is below 140mg/dl she does not have gestational diabetes.
However, in reality the mean level is unknown, all that is known are readings taken
from blood samples. Therefore, below we want to evaluate the chance of wrongly
diagnosing gestational diabetes based on the samples taken.
(a) Suppose one single measurement is made (one blood sample), what is the probability that she will be misdiagnosed as having gestational diabetes (in other words
what is the chance that her measurement will be above 140mg/dl given that a single measurement is normally distributed with mean µ = 125mg/dl and standard
deviation 10mg/dl).
The population mean is µ = 125 and the standard deviation is σ = 10.
Since we are assuming normality of the blood samples, to calculate the
= 140−125
= 1.5. Looking up
probability we make the z-transform z = x−µ
σ
10
this number in the tables and subtracting it from one (since we want
to calculate the probability the blood sample is greater than 140) gives
the probability P (Z > 1.5) = 0.0668. Thus there is a 6.68% chance she
will be falsely diagnosed based on one sample.
(b) Instead suppose that on three separate days measurements are made and the
average measurement is taken over these three days. What is the probability that
she will be misdiagnosed as having gestational diabetes (in other words what is the
chance that her average over these three measurements will be above 140mg/dl)?
Hint: What is the distribution of the sample mean based on three measurements
given that a single measurement is normally distributed with mean µ = 125mg/dl
and standard deviation 10mg/dl?
The population mean is µ = 125, the standard deviation is σ = 10 and
distribution of the blood samples are assumed to be normal. Therefore
the distribution of the sample mean based on three blood samples will
also be normal, centered√about the true population mean, µ = 125 and
with standard error 10/ 3 = 5.77. Thus to calculate the probability
that the sample mean is greater than 140 we make a z-transform
z=
x̄ − µ
140 − 125
√ =
√
= 2.60
σ/ n
10/ 3
Looking this up in the tables and subtracting one gives P (Z > 2.60) =
0.0047. Thus there is a 0.47% chance she will be falsely diagnosed based
on 3 samples.
(c) Compare your solutions from part (a) and part (b). What have you notice about
the probability of false diagnosis as a larger sample is used?
The probability of false diagnosis has a significant decrease if a larger
sample is used. That’s the reason why we usually use multiple measurements on a patient to see if he or she has some serious deseases.
(4) Suppose the scores of high school ACT test have mean 19.2 and standard deviation 5.1.
As we discussed in class, ACT scores are only very approximately normally distributed.
(a) Using the normal distribution, what is the approximate probability that a single
randomly selected student will score 23 or higher?
The population mean is µ = 19.2 and standard deviation is σ = 5. In
order to calculate the probability we assume normality (even though
= 23−19.2
=
this is not strictly true) and calculate the z-transform z = x−µ
σ
5.1
0.75. Thus the probability P (Z > 0.75) = 0.2266. In other words the
probability of a student getting over 23 marks is approximately 22.66%
(approximately because we assumed normality of the distribution of
scores)..
(b) A simple random sample of 25 students is taken. What is the mean and standard
deviation of the average score (sample mean x̄) of these 25 students?
The mean of the sample mean is the same as the population mean
µ = 19.2. The standard
deviation
of the sample mean is the standard
√
√
error, which is σ/ n = 5.1/ 25 = 1.02.
(c) Using the normal distribution, what is the approximate probability that the sample mean score of these 25 randomly selected students will be 23 or higher?
x̄−µ
√ = 23−19.2 = 3.73. Looking
Like part (a), we make a z-transform z = σ/
1.02
n
this up in the tables gives P (Z > 3.73) = 0.0001.
(d) Which of your Normal probability calculations (a) and (c) will be the most accurate, give a reason for your answer?
The central limit theorem tells us, the distribution will be much more
normal if the sample size grow larger. As we have calculated both
the probabilities in (a) and (c) under the assumption of normality, the
probability in part (c) will be a more accurate estimate the probability.
(5) The weight of airline passengers including carry-on luggage vary from passenger to passenger and airlines must take into account the passengers weight in order to determine
the distribution weight in their aircraft.
The mean weight of an airline passengar including carry-on luggage has a mean of 190
pounds and standard deviation 40 pounds (however, there is no real guarantee it is
normally distributed). The FAA stipulates that a plane carrying 30 passengers cannot
take a passenger load which is greater than 6000 pounds. Using the information in
question to calculate the probability that 30 passengers (with carry-on luggage) will
exceed this weight.
Hint: If the weight cannot exceed 6000 pounds, what does this tell this mean about the
average weight of the 30 passengers? Use this information together with the central
limit theorem to calculate the chance.
We recall that
average =
sum of weights
30
Since the sum of weight should not exceed 6000 pounds this is the same
as saying the in order for the plane to fly, the average weight should not
exceed 6000/30 = 200. Therefore, this question is equivalent to finding the
chance the sample mean is greater than 200 pounds. To calculate this we
note that the mean weight is µ = 190 and the standard deviation is σ = 40,
√
thus the standard error is 40/ 30. To calculate the probability we assume
normality (which is quite reasonable since the sample size is 30 and the
CLT says that for relatively large sample sizes the sample mean is close to
x̄−µ
√ = 200−190
√
normal) we make the z-transform z = σ/
= 1.37. Looking this up
n
40/ 30
in the tables gives the probability P (Z > 1.37) = 0.0853. Thus, there is a
8.5% chance that the weight will be greater than 6000 pounds.
The previous questions looked at designing experiments and calculating risks (probabilities), using neat results such as the central limit theorem. In the following question we try
to illustrate the central limit theorem in ‘action’ through a series of plots. The estimated
proportions that are calculated (and presented in the spread sheet) are in fact examples of
sample means. No calculations are required for this question, just an eagle eye.
(6) I have a coin that I want to ensure is fair (equal chance of getting heads or tails - ie.
proportion of heads in 0.5). I toss the coin 10 times and record the number of heads.
I get 6 heads out of 10 tosses, an estimate of the proportion of heads based on this
sample is 0.6
(a) For the question ‘is the coin fair’ which answer is correct?
(A) No, I did not get 0.5.
(B) I tossed the coin only 10 times, and even if the coin is fair there is a good chance
of tossing a head 6 times. Therefore with just 10 rolls, it is impossible to tell
whether the coin is fair or not.
The correct answer is (B). By making only 10 tosses of the coin all values
(the number of heads can be any number from zero to ten) and with a
50/50 chance of getting heads, a large range values for the number of heads
is highly likely. Similarly, if the coin is not fair (there isn’t a 50% chance of
getting a head) a wide range of values is also possible. Therfore by getting
6 heads out of 10, it isn’t possible to tell whether the coin is fair or not.
(b) It seems that the proportion of heads I get over 10 tosses has a distribution. I want
to know what this distribution looks like. I ask 100 volunteers (my hand is tired)
to toss the coin, each volunteer tosses the coin 10 times and records the number of
heads (out of ten). I evaluate the proportion of heads for each volunteer (= number
of heads/10). The raw numbers and the proportions can be found in the first two
columns of tossing coin.dat next to this homework on my website.
(i) Make a relative frequency histogram (with the normal density super imposed over
the top) for the proportion of heads out of 10 (second column in tossing coin.dat).
Is the mean of this plot close to 0.5, from the plot it is clear that the coin is fair?
See Figure 1
Despite the slight left skewed in the plot (which may or may not be
an artifact of the data) it is roughly centered about 0.5. This suggests
that the coin may be fair.
Figure 1:
(ii) I repeat the above experiment but now I ask each volunteer to toss the coin
50 times, I evaluate the proportion of heads for each volunteer (= number of
heads/50). Both the numbers of the proportions can be found in the third and
fourth column of tossing coin.dat on my website.
Make a relative frequency histogram (with the normal density super imposed over
the top) for the proportion of heads out of 50 (second column in tossing coin.dat).
From this plot do you think the coin is fair give a reason for your answer?
See Figure 2
Figure 2:
Most of the data is centered from 0.4 to 0.65 with a large fraction of
the sample mean fall in the range from 0.5 to 0.55. Suggesting that
even if the coin were unfair, the chance of getting heads is not much
different from 0.5.
(iii) Compare the means and standard deviations in part (a) and part (b) (they are
given in the plots - with the normal distribution), what are the similarities and
differences. Which plot is closer to normal give a reason for your answer (based
on what we learnt in class).
Figure 1 is more spreaded compared with Figure 2. Both of the two
figures are unimodal. The plot for proportion based on 50 looks slightly
more normal (except for the huge peak).
(iv) My friend claims that he has a biased coin. I ask my 100 volunteers to toss
the coin first 10 times each and I calculate the proportion of heads (the data is
in the fifth and sixth column of tossing coin.dat). Make a histogram of the
proportions for the biased coin (with the normal density super imposed over the
top).
Describe the plot, does it look close to normal?
See Figure 3
Figure 3:
The sample mean is 0.328. It is difficult to characterise the shape, but
it does not look symmetric, which is one major feature of the normal
distribution. Since there is a much greater chance to get a proportion
which is less than 0.5 than greater than 0.5, this suggests the coin is
not fair.
(v) Finally, I ask my 100 volunteers to toss the coin 50 times each and I calculate the proportion of heads (the data is in the seventh and eighth column of
tossing coin.dat). Make a histogram of the proportions for the biased coin
(with the normal density super imposed over the top). Compare the means and
standard deviations for the 10 tosses and 50 tosses (they are given in the plots with the normal distribution)
See Figure 4
The mean and standard deviation for the proportion based on ten tosses
is 0.38 and 0.131, whereas the mean and standard deviation for the
proportion based on 50 tosses is 0.35 and 0.067. We see than the means
are about the same but the standard deviation has gone down when
the number of tosses is increased. This fits with out understanding
that the amount of variability in an estimator decreases as the sample
size increases (here the estimator is the proportion of heads and the
sample sizes are 10 and 50 respectively).
Figure 4:
(vi) Which of the histograms (d) or (e) is closer to normal give a reason for your
answer. Do you think that my friend’s coin is biased?
The plot for the proportion over 10 tosses is not close to symmetric,
whereas the plot for the proportion over 50 tosses is symmetric. Thus
fits with our theory (the central limit theorem), that the sample mean
(which is the estimated proportion in this case) gets more and more
normal as the sample size grows.