* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Worksheet3
Survey
Document related concepts
Transcript
Worksheet 3 One and two-sample Tests 3.1 One sample t test The t test is based on the assumption that the data comes from a Normal (aka Gaussian) distribution It is also motivated by the CENTRAL LIMIT THEOREM (see lecture 3) As in Worksheet 1 we generate some pretend data, relating to energy intake of mice To begin, load in the data from the Excel spreadsheet Intake. Assign the data to the data frame called daily.intake > > daily.intake <- read.table(“Intake.txt”, header=T) > We can look at some summary statistics > > mean(daily.intake) > > > sd(daily.intake) > > > summary(daily.intake) > Suppose we wish to test whether the mice's intake was significantly different to the value of 7725. We could use a t-test > > t.test(daily.intake, mu=7725) > You get the following output (highlighted in italics) One Sample t-test data: daily.intake t = -2.8208, df = 10, p-value = 0.01814 alternative hypothesis: true mean is not equal to 7725 95 percent confidence interval: 5986.348 7520.925 sample estimates: mean of x 6753.636 The interpretation of the above is as follows -------One Sample t-test data: daily.intake --------This tells us the test being performed and the data used -------------------t = -2.8208, df = 10, p-value = 0.01814 ---------------------This tells us the value of the t-statistic (-2.8208), which is the figure before it is converted into a probability; the degrees of freedom, df=10, as we have 11 data points; and finally a p-value stating how extreme the null hypothesis is The t-statistic for "t.test(daily.intake, mu=7725)" is calculated by the following > > (mean(daily.intake)-7725)/(sd(daily.intake)/sqrt(11)) See lecture 3 ……Now, back to the output……. ---------------------alternative hypothesis: true mean is not equal to 7725 ---------------------This states that we are performing a two-sided test. That is we are interested in testing for a difference in mean AT LEAST AS BIG AS (mean(daily.intake)-7725) -----------------------95 percent confidence interval: 5986.348 7520.925 sample estimates: mean of x 6753.636 -----------------------The 95 percent confidence interval gives a region for the population mean, mu, which has a p-value GREATER than 0.05. That is, the population mean is likely (probability greater that 95%) to lie in this region. Alternalty, outside of this region the null has a p-value less that 0.05. To see this, we can take the upper limit of the confidence interval, in our case (above) 7520.925, and test it > t.test(daily.intake, mu=7520.925) We see that, as we expect, the p-value is 0.05, and for just inside the confidence region > t.test(daily.intake, mu=7520) we see a p-value greater than 0.05 Now……Suppose we were interested in testing whether the actual unknown and never known population mean, mu, was less than 7725 This is an example of a one-sided test > t.test(daily.intake, mu=7725,alternative=c("less")) One Sample t-test data: daily.intake t = -2.8208, df = 10, p-value = 0.009069 alternative hypothesis: true mean is less than 7725 95 percent confidence interval: -Inf 7377.781 sample estimates: mean of x 6753.636 Note that the "alternative hypothesis" has changed, as has the p-value, which is now more extreme Also, the confidence interval now extends from –Infinity to the left to 7377.781 to the right What do you think happens when you perform a one-sided t-test with population mean, mu, set to the sample mean? THINK ABOUT WHAT ANSWER YOU EXPECT BEFORE YOU DO IT!! > > t.test(daily.intake, mu=mean(daily.intake)) > What is the p-value? Is this what you expected? Note: the confidence intervals are unaltered by the change in tested mu value Now do a two-sided test, say, > t.test(daily.intake, mu=mean(daily.intake),alternative=c("greater")) What is the p-value? Is this what you expected? As an alternative to t-test you can use a nonparametric Wilcoxon signed-rank test > > ? wilcox.test > This looks at the size of the positive and negative values of (x_i – mu) If mu was the sample mean then these would tend to cancel 3.2 Two sample t test The two-sample t-test is when we have collected samples from two populations (conditions) and we wish to test for a difference in means Let's load some data > > library(ISwR) > data(energy) > attach(energy) > energy > > t.test(expend~stature, var.equal=T) Two Sample t-test data: expend by stature t = -3.9456, df = 20, p-value = 0.000799 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.411451 -1.051796 sample estimates: mean in group lean mean in group obese 8.066154 10.297778 which has the same structure as the one-sample test Note that the statement "expend~stature" in the t-test() states that we wish to test for differences in the variable "expend" based on the value of variable "stature" (lean/obese) To allow for un-equal variance we use the Welsh approximation t.test(expend~stature, var.equal=F) Welch Two Sample t-test data: expend by stature t = -3.8555, df = 15.919, p-value = 0.001411 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -3.459167 -1.004081 sample estimates: mean in group lean mean in group obese 8.066154 10.297778 3.3 Two sample Wilcoxon test For a distribution free (nonparametric) test use the two-sample Wilcoxon based on comparing the ranks of the data This test make no assumption about the distribution of the underlying data. That is, unlike the t-test it does not assume that the underlying population variability is Normal > wilcox.test(expend~stature) Wilcoxon rank sum test with continuity correction data: expend by stature W = 12, p-value = 0.002122 alternative hypothesis: true mu is not equal to 0 Warning message: Cannot compute exact p-value with ties in: wilcox.test.default(x Note that when there are ties (same values) in the data set then the Wilcoxon will be an approximate test