* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Hypothesis Tests
Survey
Document related concepts
Transcript
Hypothesis Tests Arthur White 10th February, 2016 The aim of this lab is to learn how to implement some of the methods for hypothesis testing which we have covered so far this semester. The underlying theory behind these methods, as well as some simple data handling routines in R will also be discussed. Try to make sure that you understand the purpose for each line of code. Probability Distributions Recall that in lab for ST1251 last semester, we used the functions dhyper, dbinom, and dpois. These give the probability functions, in this case P(X = x), for the hypergeometric, binomial and Poisson distributions respectively. This and three other primary functions are available in R for these and other probability distributions. Each function uses a one letter prefix followed by a name for the distribution. See below for a table of commands explaining the meaning of the different prefixes, and below that, a table for some of the different probability functions which are available. In this lab we will concern ourselves with the normal and Student t distributions. Prefix d p q r Meaning density probability quantile random Interpretation f (x) (= P(X = x), when X is discrete) P(X ≤ x) k such that P(X ≤ k) = p x ∼ P (·) Suffix binom pois hyper norm exp t Probability Distribution Binomial Poisson Hypergeometric Normal Exponential Student’s t 1 The Normal Distribution Firstly we will review some theory regarding the normal distribution. We use dnorm and dt to compare the behaviour of the standard normal distribion (i.e., where µ = 0 and σ = 1) to Student’s t distribution. What happens as the degrees of freedom (df) for the Student’s t increases? > > > > > > + + > > > x <- seq(-4, 4, length.out= 1000) y1 <- dnorm(x, mean = 0, sd =1) y2 <- dt(x, df = 5) y3 <- dt(x, df = 10) y4 <- dt(x, df = 50) plot(x, y1, ylim = range(y1, y2, y3, y4), type="l", ylab="f(x)", xlab="x", main ="Comparison of Normal to Student t Distribution", lwd = 3) lines(x, y2, col= "blue", lwd = 2) lines(x, y3, col= "red", lty = 2, lwd = 2) lines(x, y4, col= "cyan", lty = 2, lwd = 2) 0.2 0.0 0.1 f(x) 0.3 0.4 Comparison of Normal to Student t Distribution −4 −2 0 2 4 x Figure 1: Comparison of standard normal distribution to Student t-distribution with progessively larger degrees of freedom. To calculate probabilities, we use the function pnorm. This calculates the area under the curve to the left of a value x, much like the reference tables used in class. To calculate quantiles of X which correspond to particular percentiles, the function qnorm may be used. For example, suppose that the length X of machinery parts produced in a factory is believed to be normally distributed with mean µ = 20.1mm and standard deviation σ = 2.3mm. Then we can estimate the proportion of parts that will measure between 15mm and 25mm in length using the following code: 2 > pnorm(15, mean = 20, sd = 2.3) [1] 0.01485583 To determine the measurements a and b such that the most central 95% of parts will lie between these lengths, we can use the quantile function: > qnorm( c(0.025, 0.975), 20, 2.3 ) [1] 15.49208 24.50792 Exercise 1 Use pt to calculate the p-value of a two-sided test with ν = 15 degrees of freedom and t = −3.2. Use qt to obtain the corresponding critical values for the test, with α = 0.01. Assessing Normality Now suppose that the part of length X connects to a second part, of length Y , where Y follows a Normal distribution with mean µ = 25mm and standard deviation σ = 3.7mm. Let S = X + Y. What is the distribution of S? We can use rnorm to simulate values of X and Y , and use these to examine the relevant properties of S. For example, to estimate the mean and standard deviation of, we could use the code below. As the number of simulations increases, the approximate answers should become more and more accurate. > > > > x<- rnorm(100, mean = 20.1, sd = 2.3) y<- rnorm(100, 25, 3.7) s <- x + y mean(s) [1] 44.64785 > sd(s) [1] 4.254638 To assess the normality of data, we can use quantile-quantile, or QQ plots. The functions qqnorm and qqline can be used to perform this task. For example, to assess whether the simulated data s follows a normal distribution, use: > qqnorm(s, main ="Q-Q Normal Plot of S") > qqline(s) This plot is shown in Figure 2. 3 45 35 40 Sample Quantiles 50 55 Normal Q−Q Plot for S −2 −1 0 1 2 Theoretical Quantiles Figure 2: QQ plot for the statistic S = X + Y. Does S appear to be normally distributed? Exercise 2 Consider other transformations of X and Y , e.g., W = to be normally distributed? q X Y . Do these appear Independent two sample t-test: Diabetes dataset We will now implement a t-test in R. We analyse the systolic blood pressure rates of 21 diabetic and non-diabetic males. The dataset consists of a 21 × 2 matrix, with the first column containing the blood pressure rates of the diabetic men, and the second column the rates of the non-diabetic men. Although the data is simulated, it is based on a real analysis. The dataset is available to download at https://www.scss.tcd.ie/~arwhite/Teaching/ST1252/diabetes_sim.csv. To read the data into R, first download the data to a folder or directory of your preference. (Saving it to your desktop folder is convenient, but only as a temporary solution!) The data can be read into Rstudio by going to Tools → Import Dataset → From Text File, and then clicking on the location of the file. Alternatively, if you know the file location of the data, then the data can be read in directly using the read.csv() command. > diabetes <- read.csv(file = "~/Desktop/diabetes_sim.csv") Visually, we can compare the differences between the blood pressure rates using boxplots or stripcharts. These are displayed in Figures 3 and 4. Which group appears to have lower blood pressure? How pronounced do you think the difference is? 4 100 120 140 160 > boxplot(diabetes[,1], diabetes[, 2], names = c("Diabetic", "Non-diabetic")) Diabetic Non−diabetic Figure 3: Boxplot of diabetes dataset It is straightforward to explicitly calculate the test statistic for an independent two sample t-test in R, where we use a pooled variance: > > > > > > > > xbar <- mean(diabetes[, 1]) ybar <- mean(diabetes[, 2]) sx <- sd(diabetes[, 1]) sy <- sd(diabetes[, 2]) s.pool <- sqrt( (sx^2 + sy^2)/2 ) n <- nrow(diabetes) test.stat <- ( xbar - ybar )/( sqrt( 2 * s.pool^2 / n ) ) test.stat [1] 2.270401 > t.test(x = diabetes[, 1], y = diabetes[, 2], var.equal = TRUE) Two Sample t-test data: diabetes[, 1] and diabetes[, 2] t = 2.2704, df = 40, p-value = 0.02864 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.299436 22.366364 sample estimates: mean of x mean of y 152.7791 140.9462 5 100 120 140 160 > stripchart(diabetes, method="jitter", + vertical = TRUE, group.names = c("Diabetic", "Non-diabetic")) Diabetic Non−diabetic Figure 4: Stripchart of diabetes dataset. The test statistic in the function’s output should correspond exactly to the test.stat. Exercise 3 Identify the test statistic, degrees of freedom, and p-value from the output of t.test() for the pooled variance test. Use pt to check whether the p-value in this output corresponds to what you would obtain based on your tests statistic. What is the 95% confidence interval for the differences between diabetic and non-diabetic men? Interpret this with respect to the result of the test. Model validation We have specified the optional argument that var.equal = TRUE, i.e., that we have assumed equal variances. We can test this assumption directly, as well as performing a t-test on the data with the equal variances assumption relaxed. (This is in fact the default configuration of the test.) > var.test(diabetes[, 1], diabetes[, 2]) F test to compare two variances data: diabetes[, 1] and diabetes[, 2] F = 1.3934, num df = 20, denom df = 20, p-value = 0.4648 alternative hypothesis: true ratio of variances is not equal to 1 95 percent confidence interval: 6 0.5653781 3.4339273 sample estimates: ratio of variances 1.393365 > t.test(diabetes[, 1], diabetes[, 2], var.equal=FALSE) Welch Two Sample t-test data: diabetes[, 1] and diabetes[, 2] t = 2.2704, df = 38.948, p-value = 0.02879 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 1.290565 22.375234 sample estimates: mean of x mean of y 152.7791 140.9462 Here we have little evidence to suggest that our equal variance assumption is unreasonable. Note that in this case the output from the latter t-test (formally, Welch’s test) is much the same as for the pooled variance test. Exercise 4 Check the normality assumptions underlying the test, visually, using, qqnorm and qqline, and formally, with shapiro.test. Paired t-test: Sleep data example We now consider a paired t-test applied to Student’s Sleep data, which shows the effect of two soporific drugs (increase in hours of sleep compared to control) on 10 patients. Because we see the effect of both drugs on each patient, it is possible to analyse the data using a paired test. By default, the data sleep is loaded in R. Note that the structure of the dataset is different than for the diabetes example; the R commands which we use will reflect this. To perform a paired t-test in R, the optional argument paired = TRUE is used. We compare this to the default case, where this is set to be false. > t.test(extra ~ group,data=sleep, paired = TRUE) Paired t-test data: extra by group t = -4.0621, df = 9, p-value = 0.002833 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 7 -2.4598858 -0.7001142 sample estimates: mean of the differences -1.58 > t.test(extra ~ group,data=sleep, var.equal = TRUE) Two Sample t-test data: extra by group t = -1.8608, df = 18, p-value = 0.07919 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.363874 0.203874 sample estimates: mean in group 1 mean in group 2 0.75 2.33 In this case, failing to use a paired test would have resulted in a different conclusion. The reason for this is that the effect of either drug between patients is more variable than the difference between drugs per patient, although Drug 2 is consistently more effective. Thus the effect is highly dependent on which patient has taken the drug. This is most clearly demonstrated by plotting the effects of the drugs against each other. A strong linear trend is clearly visible. > plot( sleep[1:10, 1], sleep[11:20, 1], + xlab = "Drug 1", ylab ="Drug 2", + main = "Sleep Dataset" ) 3 2 1 0 Drug 2 4 5 Sleep Dataset −1 0 1 2 3 Drug 1 8