Download Hypothesis Tests

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Hypothesis Tests
Arthur White
10th February, 2016
The aim of this lab is to learn how to implement some of the methods for
hypothesis testing which we have covered so far this semester. The underlying
theory behind these methods, as well as some simple data handling routines in
R will also be discussed. Try to make sure that you understand the purpose for
each line of code.
Probability Distributions
Recall that in lab for ST1251 last semester, we used the functions dhyper, dbinom, and dpois. These give the probability functions, in this case P(X = x), for
the hypergeometric, binomial and Poisson distributions respectively. This and
three other primary functions are available in R for these and other probability
distributions. Each function uses a one letter prefix followed by a name for the
distribution.
See below for a table of commands explaining the meaning of the different
prefixes, and below that, a table for some of the different probability functions
which are available. In this lab we will concern ourselves with the normal and
Student t distributions.
Prefix
d
p
q
r
Meaning
density
probability
quantile
random
Interpretation
f (x) (= P(X = x), when X is discrete)
P(X ≤ x)
k such that P(X ≤ k) = p
x ∼ P (·)
Suffix
binom
pois
hyper
norm
exp
t
Probability Distribution
Binomial
Poisson
Hypergeometric
Normal
Exponential
Student’s t
1
The Normal Distribution
Firstly we will review some theory regarding the normal distribution. We use
dnorm and dt to compare the behaviour of the standard normal distribion (i.e.,
where µ = 0 and σ = 1) to Student’s t distribution. What happens as the
degrees of freedom (df) for the Student’s t increases?
>
>
>
>
>
>
+
+
>
>
>
x <- seq(-4, 4, length.out= 1000)
y1 <- dnorm(x, mean = 0, sd =1)
y2 <- dt(x, df = 5)
y3 <- dt(x, df = 10)
y4 <- dt(x, df = 50)
plot(x, y1, ylim = range(y1, y2, y3, y4),
type="l", ylab="f(x)", xlab="x",
main ="Comparison of Normal to Student t Distribution", lwd = 3)
lines(x, y2, col= "blue", lwd = 2)
lines(x, y3, col= "red", lty = 2, lwd = 2)
lines(x, y4, col= "cyan", lty = 2, lwd = 2)
0.2
0.0
0.1
f(x)
0.3
0.4
Comparison of Normal to Student t Distribution
−4
−2
0
2
4
x
Figure 1: Comparison of standard normal distribution to Student t-distribution
with progessively larger degrees of freedom.
To calculate probabilities, we use the function pnorm. This calculates the
area under the curve to the left of a value x, much like the reference tables used
in class. To calculate quantiles of X which correspond to particular percentiles,
the function qnorm may be used.
For example, suppose that the length X of machinery parts produced in
a factory is believed to be normally distributed with mean µ = 20.1mm and
standard deviation σ = 2.3mm. Then we can estimate the proportion of parts
that will measure between 15mm and 25mm in length using the following code:
2
> pnorm(15, mean = 20, sd = 2.3)
[1] 0.01485583
To determine the measurements a and b such that the most central 95% of parts
will lie between these lengths, we can use the quantile function:
> qnorm( c(0.025, 0.975), 20, 2.3 )
[1] 15.49208 24.50792
Exercise 1
Use pt to calculate the p-value of a two-sided test with ν = 15 degrees of freedom
and t = −3.2. Use qt to obtain the corresponding critical values for the test,
with α = 0.01.
Assessing Normality
Now suppose that the part of length X connects to a second part, of length
Y , where Y follows a Normal distribution with mean µ = 25mm and standard
deviation σ = 3.7mm. Let S = X + Y. What is the distribution of S?
We can use rnorm to simulate values of X and Y , and use these to examine
the relevant properties of S. For example, to estimate the mean and standard
deviation of, we could use the code below. As the number of simulations increases, the approximate answers should become more and more accurate.
>
>
>
>
x<- rnorm(100, mean = 20.1, sd = 2.3)
y<- rnorm(100, 25, 3.7)
s <- x + y
mean(s)
[1] 44.64785
> sd(s)
[1] 4.254638
To assess the normality of data, we can use quantile-quantile, or QQ plots.
The functions qqnorm and qqline can be used to perform this task. For example,
to assess whether the simulated data s follows a normal distribution, use:
> qqnorm(s, main ="Q-Q Normal Plot of S")
> qqline(s)
This plot is shown in Figure 2.
3
45
35
40
Sample Quantiles
50
55
Normal Q−Q Plot for S
−2
−1
0
1
2
Theoretical Quantiles
Figure 2: QQ plot for the statistic S = X + Y. Does S appear to be normally
distributed?
Exercise 2
Consider other transformations of X and Y , e.g., W =
to be normally distributed?
q
X
Y .
Do these appear
Independent two sample t-test: Diabetes dataset
We will now implement a t-test in R. We analyse the systolic blood pressure rates
of 21 diabetic and non-diabetic males. The dataset consists of a 21 × 2 matrix,
with the first column containing the blood pressure rates of the diabetic men,
and the second column the rates of the non-diabetic men. Although the data is
simulated, it is based on a real analysis. The dataset is available to download at
https://www.scss.tcd.ie/~arwhite/Teaching/ST1252/diabetes_sim.csv.
To read the data into R, first download the data to a folder or directory of
your preference. (Saving it to your desktop folder is convenient, but only as a
temporary solution!)
The data can be read into Rstudio by going to Tools → Import Dataset →
From Text File, and then clicking on the location of the file. Alternatively, if
you know the file location of the data, then the data can be read in directly
using the read.csv() command.
> diabetes <- read.csv(file = "~/Desktop/diabetes_sim.csv")
Visually, we can compare the differences between the blood pressure rates
using boxplots or stripcharts. These are displayed in Figures 3 and 4. Which
group appears to have lower blood pressure? How pronounced do you think the
difference is?
4
100
120
140
160
> boxplot(diabetes[,1], diabetes[, 2], names = c("Diabetic", "Non-diabetic"))
Diabetic
Non−diabetic
Figure 3: Boxplot of diabetes dataset
It is straightforward to explicitly calculate the test statistic for an independent two sample t-test in R, where we use a pooled variance:
>
>
>
>
>
>
>
>
xbar <- mean(diabetes[, 1])
ybar <- mean(diabetes[, 2])
sx <- sd(diabetes[, 1])
sy <- sd(diabetes[, 2])
s.pool <- sqrt( (sx^2 + sy^2)/2 )
n <- nrow(diabetes)
test.stat <- ( xbar - ybar )/( sqrt( 2 * s.pool^2 / n ) )
test.stat
[1] 2.270401
> t.test(x = diabetes[, 1], y = diabetes[, 2], var.equal = TRUE)
Two Sample t-test
data: diabetes[, 1] and diabetes[, 2]
t = 2.2704, df = 40, p-value = 0.02864
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.299436 22.366364
sample estimates:
mean of x mean of y
152.7791 140.9462
5
100
120
140
160
> stripchart(diabetes, method="jitter",
+
vertical = TRUE, group.names = c("Diabetic", "Non-diabetic"))
Diabetic
Non−diabetic
Figure 4: Stripchart of diabetes dataset.
The test statistic in the function’s output should correspond exactly to the
test.stat.
Exercise 3
Identify the test statistic, degrees of freedom, and p-value from the output of
t.test() for the pooled variance test. Use pt to check whether the p-value in
this output corresponds to what you would obtain based on your tests statistic.
What is the 95% confidence interval for the differences between diabetic and
non-diabetic men? Interpret this with respect to the result of the test.
Model validation
We have specified the optional argument that var.equal = TRUE, i.e., that we
have assumed equal variances. We can test this assumption directly, as well as
performing a t-test on the data with the equal variances assumption relaxed.
(This is in fact the default configuration of the test.)
> var.test(diabetes[, 1], diabetes[, 2])
F test to compare two variances
data: diabetes[, 1] and diabetes[, 2]
F = 1.3934, num df = 20, denom df = 20, p-value = 0.4648
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
6
0.5653781 3.4339273
sample estimates:
ratio of variances
1.393365
> t.test(diabetes[, 1], diabetes[, 2], var.equal=FALSE)
Welch Two Sample t-test
data: diabetes[, 1] and diabetes[, 2]
t = 2.2704, df = 38.948, p-value = 0.02879
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
1.290565 22.375234
sample estimates:
mean of x mean of y
152.7791 140.9462
Here we have little evidence to suggest that our equal variance assumption is
unreasonable. Note that in this case the output from the latter t-test (formally,
Welch’s test) is much the same as for the pooled variance test.
Exercise 4
Check the normality assumptions underlying the test, visually, using, qqnorm
and qqline, and formally, with shapiro.test.
Paired t-test: Sleep data example
We now consider a paired t-test applied to Student’s Sleep data, which shows
the effect of two soporific drugs (increase in hours of sleep compared to control)
on 10 patients. Because we see the effect of both drugs on each patient, it is
possible to analyse the data using a paired test.
By default, the data sleep is loaded in R. Note that the structure of the
dataset is different than for the diabetes example; the R commands which we
use will reflect this.
To perform a paired t-test in R, the optional argument paired = TRUE is
used. We compare this to the default case, where this is set to be false.
> t.test(extra ~ group,data=sleep, paired = TRUE)
Paired t-test
data: extra by group
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
7
-2.4598858 -0.7001142
sample estimates:
mean of the differences
-1.58
> t.test(extra ~ group,data=sleep, var.equal = TRUE)
Two Sample t-test
data: extra by group
t = -1.8608, df = 18, p-value = 0.07919
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-3.363874 0.203874
sample estimates:
mean in group 1 mean in group 2
0.75
2.33
In this case, failing to use a paired test would have resulted in a different
conclusion. The reason for this is that the effect of either drug between patients
is more variable than the difference between drugs per patient, although Drug
2 is consistently more effective. Thus the effect is highly dependent on which
patient has taken the drug. This is most clearly demonstrated by plotting the
effects of the drugs against each other. A strong linear trend is clearly visible.
> plot( sleep[1:10, 1], sleep[11:20, 1],
+
xlab = "Drug 1", ylab ="Drug 2",
+
main = "Sleep Dataset" )
3
2
1
0
Drug 2
4
5
Sleep Dataset
−1
0
1
2
3
Drug 1
8