Download en-pdf

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Obligatory assignments, solution notes, week 6
1.
2.
3. > N = 10
>
>
>
>
mu = 0.5
sigma = 1
x = rnorm(N,mu,sigma)
t.test(x,conf.level=0.9)
One Sample t-test
data: x
t = 1.8071, df = 9, p-value = 0.1042
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
-0.008222728 1.151773396
sample estimates:
mean of x
0.5717753
> solution = uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.9), lower = 0, upper = 4,tol = 0.000
> z = solution$root
> left =
mean(x) - z*sd(x)/sqrt(N)
> right =
mean(x) + z*sd(x)/sqrt(N)
> cat ("CI: <",left,",",right,">\n")
CI: < -0.008222728 , 1.151773 >
4. withinKnownSigma=0
N=10
Ntotal=200000
mu=1
sd=1
sKn = sd/sqrt(N)
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
if(mu > x - sKn & mu < x + sKn){withinKnownSigma=withinKnownSigma+1}
# Sigma known. Same CI each time, width = 2 sKn.
# Should give pnorm(1) - pnorm(-1) = 0.6826895
}
pG=withinKnownSigma/Ntotal
cat("Experiment, known sigma, within CI:",pG,"\n")
cat("Theoretical, normal distribution: ",pnorm(1) - pnorm(-1))
5. within=0
averCI=0
1
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
s = sd(set)/sqrt(N)
averCI = averCI + 2*s
if(mu > x - s & mu < x + s){within=within+1}
# Sigma unknown. Creating new CI each time, with = 2s is varying
# Should be within with probability pt(1,N-1) - pt(-1,N-1)
}
p=within/Ntotal
cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n")
# The random variable (x-mu)/s follows student’s t distribution
cat("Theoretical, Student’s t distribution: ",pt(1,N-1) - pt(-1,N-1))
averCI = averCI/Ntotal
cat("\n\nKnown sigma CI-length:",2*sd/sqrt(N))
cat("\nAverage CI-length: ",averCI)
6. In the first case the value of σ is assumed known and the CI is constructed
using this value. In the second case the CI is constructed using the sample
standard deviation s.
In an experiment the value of σ would not be known and you would have
to use the second method based on s.
7. N=5
heading="Comparing normal and student’s t distribution"
plot.new()
Tsample=vector()
CLTsample=vector()
CLTscaledSample=vector()
Nr=50000
x = seq(-7,7,length=500)
sd=2
mu=1
for (i in 1:Nr){
set = rnorm(N,mu,sd)
xbar = mean(set)
# mean, -> norm dist for large N
CLTsample <- c(CLTsample,xbar) # according to CLT
CLTscaledSample <- c(CLTscaledSample,(xbar-mu)/(sd/sqrt(N)))
# scaled normal distribution
t = (xbar-mu)/(sd(set)/sqrt(N)) # This variable t follows
Tsample <- c(Tsample,t)
# the student’s t distribution
}
br=c(min(Tsample),seq(-6.5,6.5,length=400),max(Tsample))
hCLT=hist(CLTscaledSample,breaks=br,freq=FALSE,main=heading,xlim=c(-4,4),col=rgb(1,0,0,1/4),borde
lines(x,dnorm(x,0,1),type="l",col="red",lwd=3)
legend(2,0.4,legend=paste("normal,sd = ",1),fill="red")
legend(-4,0.35,legend=paste("N = ",N))
hT=hist(Tsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,0,1,1/4),border=rgb(0,0,1,1/4),add=T)
2
lines(x,dt(x,N-1),type="l",col="blue",lwd=3)
legend(2,0.35,legend="student’s t, df=N-1",fill="blue")
hT=hist(CLTsample,breaks=br,freq=FALSE,xlim=c(-4,4),col=rgb(0,1,0,1/4),border=rgb(0,1,0,1/4),add=
lines(x,dnorm(x,mu,sd/sqrt(N)),type="l",col="green",lwd=3)
legend(2,0.3,legend=sprintf("normal,sd = %.2f",sd/sqrt(N)),fill="green")
8. t.test(x)
One Sample t-test
data: x
t = 2.3892, df = 99, p-value = 0.01878
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.03778428 0.40807200
mean of x
0.2229281
This result shows that one should reject the null hypothesis. A p-value
of 0.02 means that if the null hypothesis mean = 0 was true, such an
extreme result as mean = 0.22 would just occure 2 out of a 100 times.
The 95 percent confidence interval < 0.038, 0.408 > means that there is a
95% probability that the true mean is within this interval. That is, if you
keep on drawing samples of a 100 numbers from the same distribution, on
average the CI would include the true mean in 95 out of a 100 times.
t.test(y)
One Sample t-test
data: y
t = 0.054, df = 99, p-value = 0.957
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.1632961 0.1724413
mean of x
0.004572609
This result shows that the null hypothesis can’t and shouldn’t be rejected.
In fact, that a mean of x so close to zero or closer would just occure in
4 out of a 100 times if the null hypothesis was true. But this does not
mean that there is a probability of 96% that the true mean is zero! The
95 percent confidence interval < −0.163, 0.172 > means that there is a
95% probability that the true mean is within this interval. That is, if you
keep on drawing samples of a 100 numbers from the same distribution, on
average the CI would include the true mean in 95 out of a 100 times. So
in this case the CI is far more valuble than the p-value.
9. The H0 hypothesis is that the true difference in means is zero, just as it
would be if the samples where drawn from the same distribution. H1 is
the opposite, the means are unequal.
> t.test(x,y)
3
Welch Two Sample t-test
data: x and y
t = 1.7336, df = 196.13, p-value = 0.08455
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.03003875 0.46674981
mean of x
mean of y
0.222928138 0.004572609
This result shows that the null hypothesis can’t be rejected as the result
is not statistically significant. A p-value of 0.08 means that if the null
hypothesis equal means was true, such an extreme result as difference in
means = 0.218 would just occure 8 out of a 100 times. As this is not
too unliklye, the result is not significant. The CI tells us that there is a
95% probability that the true difference in means is within the interval
< −0.03, 0.47 >. Again, the CI gives us the most precise information.
However, quite often only the p-value is used in applied statistics.
If a 90 percent confidence interval is calculated the result is < 0.01, 0.42 >,
so there is a 90% probability that the true difference in means is unequal.
10. A Welch Two Sample t-test gives a p-value = 0.2517 and a 95 percent
confidence interval < −46, 145 >. The p-value is much larger than 0.05
which means that this is a quite probable outcome if the null hypotheses
is true, so that one can not conclude that there is a difference in performance between the two systems. From the CI we see that there is a 95%
probability that the true difference of the means are in this interval. The
interval includes zero and again, we can not conclude that the systems are
different.
If we calculate a 74.83 percent confidence interval (1-0.2517 = 0.7483) we
get a < 0.004, 99.3 > CI (why is zero just at the edge of this CI?). This is
just another way to interprtet the p-value. 74.83% of the times one makes
such a test, the mean difference would be in this range, not including zero.
11. Now the Welch Two Sample t-test gives a p-value = 0.08744. If we decide
before doing the experiments to use a significance of 0.05, we can still not
conclude that there is a significantly difference between the two systems.
The 95 percent confidence interval gives < −8.4, 109 > and the 95% CI
just includes zero. But still, there is not enough evidence to reject the null
hypothesis. If we do a 91.256 % CI we get the interval < 0.00, 100.5 > and
in more than 90% of the times one makes such a test, the mean difference
would be in this range, not including zero.
If a significance level of 0.1 was chosen before we did the experiment, we
could have rejected the null hypothesis.
12. Boxplot. The bottom and top of the box are the 25th and 75th percentile,
and the line near the middle of the box is the median. The default value
4
1300
1200
1100
1000
900
800
700
Figure 1: Boxplot of lamp lifetimes
of the range is 1.5, so the whiskers are placed at the most extrem points
of the sample, but at most at 1.5 times the interquartile range (= hight
of the box). The boxplot shows roughly how skew the distribution is
and shows where the central half part of the datapoints are located. On
the other hand the standard deviation is symmetric, and will show no
skewness. Additionally the boxplot shows the extrema, if necessarry as
outliers. Since there are no outliers in this case, all points are within the
whiskers. Both the standard deviation and the mean might change a lot
whith large outliers, while the median is less influenced by such extreme
points.
The following shows that both max and min are within 1.5 times the
interquantile range outside the box, so the whiskers are posisioned at min
and max:
> max(x)
[1] 1340
> min(x)
[1] 702
5
> median(x)
[1] 1009
> quantile(x)
0%
25%
50%
75%
100%
702.00 924.50 1009.00 1154.75 1340.00
> IQR(x)
# (Inter Quantile Range)
[1] 230.25
> 1154.75-924.50
[1] 230.25
> quantile(x,probs=0.75)
75%
1154.75
> quantile(x,probs=0.75) + 1.5*IQR(x)
1500.125
> quantile(x,probs=0.25) - 1.5*IQR(x)
579.125
This is because 702 is larger than 579 and 1340 is smaller than 1500.
The boxplot shows that the distribution is somewhat skew, the lower 25%
quantile is much closer to the median than the upper.
13. This extreme point is now outside the box by more than 1.5 times the
interquantile range:
> quantile(x,probs=0.75) +
1495.75
1.5*IQR(x)
since the extreme value 1896 is larger than the 75% quantile + 1.5 times
the interquantile range = 1495.75, it is plotted as an outlier. The upper
whisker is then placed at the next largest point, 1340, since this is below
1495.75. Comparing the median and mean, we see that the mean and the
standard deviation changes a lot more than the median and the quantiles
because of this single error.
measure
mean
sd
median
25% quantile
75% quantile
correct
1027.08
148.4
1009
924.5
1154.75
with typo
1047.08
191.5
1015.5
930.75
1156.75
The boxplot is almost unchanged (if you zoom in on it) except that it
shows the single outlier. It is thus a method which gives is more robust
to individual error points and gives more information than just the mean
and the standard deviation.
14. The 95 percent confidence interval is < 985, 1069 > but this is a mesaure
of how accurate the calculated sample mean is. It is a probability of 95%
that the true mean is within this range. And this does not saying anything
6
800
1000
1200
1400
1600
1800
●
Figure 2: Boxplot of lamp lifetimes
about the porbability of individual datapoints. So the probability is NOT
95%!
The probability for a liftime within this interval can just be estimated and
a reasonable estimate can be found by counting the number of datapoints
within this interval in the current sample. Running sort(x) one finds that
only
1009 1009 1022 1035 1037 1045 1067
are whithin this range, or 7 out 50 and thus 14%. This is of course just a
rough estimate of the probability, but it is certainly far from 95%.
If we assume that the distribution is Gaussian whith a mean of 1027 and
a standard deviation of 148, the probability is
> 1 - 2*pnorm(985,mean=1027.08,sd=148.4)
[1] 0.2232508
which gives a bit larger probability, 22%. However, there is no strong
evidence for a Gaussian distribution since the data are so skew.
7
15. x <- rnorm(10,1,1)
t.test(x)
One Sample t-test, p-value = p-value = 0.02408
2*(pnorm(-mean(x),0,sd(x)/sqrt(10))) = 0.006771427
# Much smaller, n is just 10 and sd(x) = 1.19 is far from 1
SIGN.test(x)
One-sample Sign-Test, p-value = 0.1094
> x
[1] 2.15935869 -0.36024036 2.36360647
[8] 1.52291449 0.48742194 1.79825463
0.00839999
1.18033842
2.10846273 -1.07201314
pbinom(2,10,0.5)*2 = 0.109375
# This is the exact answer. There where 2 values with a minus sign.
# p is the probability for 0,1,2 or 8,9,10 minus signs.
16. > x <- rnorm(10,1,1)
> t.test(x)$p.value
[1] 0.0007673417
> SIGN.test(x)
s = 10, p-value = 0.001953
> x <- rnorm(10,1,1)
> t.test(x)$p.value
[1] 0.01786454
> SIGN.test(x)
s = 8, p-value = 0.1094
> x <- rnorm(10,1,1)
> t.test(x)$p.value
[1] 0.02395486
> SIGN.test(x)
s = 8, p-value = 0.1094
> x <- rnorm(10,1,1)
> t.test(x)$p.value
[1] 0.007523077
> SIGN.test(x)
s = 9, p-value = 0.02148
x <- rnorm(10,1,1)
> t.test(x)$p.value
[1] 0.03139396
> SIGN.test(x)
s = 8, p-value = 0.1094
Each time the t.test p-value is smaller. This is as expected, it is a more
powerful test and potentially gives more precise results. However, it depends on the data to be normal distributed when n is som small. Otherwise
the estimate of the p-value might be wrong. The sign-test has no such
8
dependency.
17. > y <- c(894,963,1098,982,1046,1002,989,994)
> x <- c(1011,998,1113,1008,1100,1039,1003,1098)
> t.test(x,y)
t = 1.8423, df = 13.537, p-value = 0.08744
> SIGN.test(x,y)
Dependent-samples Sign-Test
S = 8, p-value = 0.007812
In this case the SIGN-test gives a smaller p-value. We see that by chance,
that is, the way they are paired, all the values of the x set are lager than
those of the y set, giving a small p-value for the t-test. But note that if
one of the arrays are shuffled, you might get another result:
> w = sample(y)
> SIGN.test(x,w)
S = 6, p-value = 0.2891
> x
[1] 1011 998 1113 1008 1100 1039 1003 1098
> w
[1] 989 963 982 1046 894 1098 994 1002
The t.test is of course independent of such a permutation of the sample.
18. > N = 10
>
>
>
>
mu = 0.5
sigma = 1
x = rnorm(N,mu,sigma)
t.test(x,conf.level=0.9)
One Sample t-test
data: x
t = 1.8071, df = 9, p-value = 0.1042
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
-0.008222728 1.151773396
sample estimates:
mean of x
0.5717753
>
>
>
>
>
>
>
>
t
t = mean(x)/(sd(x)/sqrt(N));
df = N-1
solution = uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.9), lower = 0, upper = 4,tol = 0.000
z = solution$root
p = 2*pt(-t,N-1)
left =
mean(x) - z*sd(x)/sqrt(N)
right =
mean(x) + z*sd(x)/sqrt(N)
cat ("t =",t,"df =",df,"p-value =",p,"\n")
= 1.807125 df = 9 p-value = 0.1042085
9
> cat ("CI: <",left,",",right,">\n")
CI: < -0.008222728 , 1.151773 >
> cat ("mean: ",mean(x),"\n")
mean: 0.5717753
19. > # Assuming normal distribution
>
>
>
>
>
>
>
>
solution = uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.90), lower = 0, upper = 4,tol = 0.0000
z = solution$root
p = 2*pnorm(-t)
left = mean(x) - z*1/sqrt(N)
right = mean(x) + z*1/sqrt(N)
#left = mean(x) - z*sd(x)/sqrt(N) # Now we know that sd = 1
#right = mean(x) + z*sd(x)/sqrt(N)
t.test(x,conf.level=0.9)
One Sample t-test
data: x
t = 1.8071, df = 9, p-value = 0.1042
alternative hypothesis: true mean is not equal to 0
90 percent confidence interval:
-0.008222728 1.151773396
sample estimates:
mean of x
0.5717753
> cat ("t =",t,"df =",df,"p-value =",p,"\n")
t = 1.807125 df = 9 p-value = 0.07074286
> cat ("CI: <",left,",",right,">\n")
CI: < 0.05162674 , 1.091924 >
> cat ("mean: ",mean(x),"\n")
mean: 0.5717753
Since we know that σ = 1, the numbers calculated based on the normal
distribution follows from Theorem 2 and hence these numbers are exact.
If we draw numbers over and over from such a distribution the true mean,
µ = 0 would on average be within this interval 90% of the time,
However, the 90% CI of the t-test is also exact, in the way that if this
procedure was followed over and over again, the true mean would be within
the CI 90% of the time. But since the CI also depends on the value of the
sample standard mean, it’s size would be different every time and could
be really large or really small.
20. If we did not know σ, the CI based on the student-t distribution would be
the correct one, because the calculation of the CI would have to depend on
the sample standard deviation which also varies. So if such a calculation
was performed over and over again, 90 % of the time the true mean would
be within the CI calculated each time. The size of the CI would vary for
each new measurement, but on average the true mean, µ = 0 would be
within this interval 90% of the time.
10
21. z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root
withinKnownSigma=0
N=5
Ntotal=200000
mu=0
sd=1
sKn = sd/sqrt(N)
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
if(mu > x - z*sKn & mu < x + z*sKn){withinKnownSigma=withinKnownSigma+1}
# Sigma known. Same CI each time, with = 2 z sKn.
# Should give pnorm(z) - pnorm(-z) = 0.95
}
pG=withinKnownSigma/Ntotal
cat("Experiment, known sigma, within CI:",pG,"\n")
cat("Theoretical, normal distribution: ",pnorm(z) - pnorm(-z))
22. z=uniroot(function(z) (pt(z,N-1) - pt(-z,N-1) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root
within=0
averCI=0
for (i in 1:Ntotal){
set=rnorm(N,mu,sd)
x = mean(set)
s = sd(set)/sqrt(N)
averCI = averCI + 2*s*z
if(mu > x - z*s & mu < x + z*s){within=within+1}
# Sigma unknown. Creating new CI each time, with = 2s is varying
# Should be within with probability pt(1,N-1) - pt(-1,N-1)
}
p=within/Ntotal
cat("\n\nExperiment, unknown sigma, within CI: ",p,"\n")
# The random variable (x-mu)/s follows student’s t distribution
cat("Theoretical, Student’s t distribution: ",pt(z,N-1) - pt(-z,N-1))
averCI = averCI/Ntotal
z=uniroot(function(z) (pnorm(z) - pnorm(-z) - 0.95), lower = 0, upper = 4,tol = 0.00001)$root
cat("\n\nKnown sigma CI-length:",2*sd*z/sqrt(N))
cat("\nAverage CI-length: ",averCI)
23. We use the t-test of assignment 8 on the set x of 100 numbers as an
example:
t.test(x)
One Sample t-test
data: x
t = 2.3892, df = 99, p-value = 0.01878
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
0.03778428 0.40807200
11
mean of x
0.2229281
The following R-script simply draws a random sample of 100 numbers
from a normal distribution with mean value zero and standard deviation
equal to 1 ten million times:
N = 0
total = 10000000
for (i in 1:total){
x <- rnorm(100,sd=1.0)
m = mean(x)
if(m > 0.2229281 | m < -0.2229281){
N = N + 1
}
}
P = N/(1.0*total)
cat ("P: ",P,"\n")
When run it gives P: 0.0258132 which is a bit larger than the result from
the t-test, p-value = 0.019. Our result in this experiment is correct since
it actually carries out the experiment that the p-value asked for in the
assignment corresponds to:
How probable is it that the sample x is from a normal distribution with mean value zero and standard deviation 1
and that the mean value just by chance deviates that much
from zero.
However, the t-test result is also correct if the case was that we did not
know the value of the standard deviation σ, as explained below.
In the present case we know that the numbers are drawn from a normal
distribution and the p-value equals the area outside the extreme values
-0.2229281 and 0.2229281
√ of the gaussian function with mean zero and
standard deviation 1/ N = 0.1 as Theorem 2 states. And indeed, using
R gives the following result:
> 2*pnorm(-0.2229281,sd=0.1)
[1] 0.02579521
consistent with the result of the experiment.
However, is this p-value really what we would want to calculate if we
wanted to test a hypothesis based on the results of the sample x? No,
because then we would not know that the true standard deviation of the
random distribution the sample x was drawn from has σ = 1. If we knew,
12
the p-value calculated above would be the one we wanted. But if we do
not know the value of σ, we have to rely on the sample standard deviation
s calculated from the N = 100 sample measurements. So we want to find
the p-value which predicts:
How probable is it that the sample x is from a normal distribution with mean value zero and that the mean value just
by chance deviates that much from zero.
The result given by the t-test is based on the single sample of N = 100
measurements only. And no knowledge of the true value of the standard
deviation σ = 1. The student-t distribution is the distribution of the
variable
x−µ
t= √
s/ N
Assuming µ = 0, and calculating sd(x) = 0.9330827, the recorded result
is t = 0.2229281/0.09330827 = 2.3891569 = 2.4. Based on this single
sample average and sample standard deviation, the p-value is calculated
as
> 2*pt(-t,99)
[1] 0.01878143
The p-value calculated is the probability for obtaining such an extreme
result as t or a more extreme result, given that the underlying distribution
is normal but that σ is unknown. And the way to perform an experiment
which corresponds to this p-value is to repeat the experiment over and
over again and record how often the new result
tnew =
x−µ
√
s/ N
is larger than the value we started out to compare with, t = 2.4. If this
experiment is carried out as follows using R:
N = 0
total = 10000000
t = 0.2229281/0.09330827
for (i in 1:total){
x <- rnorm(100,sd=1)
m = mean(x)/(sd(x)/10)
if(m > t | m < -t){
N = N + 1
}
}
P = N/(1.0*total)
cat ("P: ",P,"\n")
13
the result is
P:
0.0187898
and indeed, it fits the result of the t-test.
14
Related documents