* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download STAT05 – Inferential Statistics
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Statistical inference wikipedia , lookup
History of statistics wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Applied statistics for testing and evaluation MED4 STAT05 –– Inferential Statistics Inferential statistics Lecturer: Smilen Dimitrov 1 STAT05 – Inferential Statistics Introduction • We previously discussed measures of – central tendency (location) of a data sample (collection) – arithmetic mean, median and mode; and – statistical dispersion (variability) – range, variance and standard deviation in descriptive statistics • Today we look into the concept of distributions in more detail, and introduce quantiles as final topic in descriptive statistics • We will look at how we perform these operations in R, and a bit more about plotting as well 2 STAT05 – Inferential Statistics Review of frequency distributions y axis Relative frequency distribution rf<-table(Data.Sample)/length(Data.Sample) y<- rf[”x”] y value Relative frequency of occurrence of x The value if y is relative frequency of occurrence of x – percent of cases/times x has occurred in a x axis x value y axis x <- quantile[Data.Sample, y] y value Relative frequency of occurrence of values less than x x value x axis Cumulative frequency distribution 3 STAT05 – Inferential Statistics Review of PDF and CDF y axis PDF – probability density function y value y = p(x) = pdf(x) The probability of getting exactly x x value x axis y axis q q y value y = f(x) = cdf(x) The probability of getting less Quantiles – the x values, obtained by dividing the y range of the CDF (from 0 to 1) in q-equal parts 2 1 1 than x (area under the pdf curve, from –infinity to x) 2 x value x axis CDF – cumulative distribution function x – second quartile = median 4 STAT05 – Inferential Statistics Review of PDF and CDF – Uniform distribution in R y axis runif gives random samples, unifromly distributed. -3, 3) must be added to specify range. Parameters – range (min and max value) y value y = p(x) = pdf(x) dunif(0.7) The probability of getting gives y=0.016.. y = pdf(x) = dunif(x) exactly x x value x axis y axis y value y = f(x) = cdf(x) The probability of getting less qunif(0.616) gives x=0.7 x = cdf-1(y) = qunif(x) punif(0.7) gives y=0.616.. y = cdf(x) = punif(x) than x (area under the pdf curve) x value x axis 5 STAT05 – Inferential Statistics Review of PDF and CDF – Normal distribution in R y axis rnorm gives random samples, normally distributed. y value y = p(x) = pdf(x) Parameters – mean and median The probability of getting dnorm(0.7) exactly x (can translate and scale the curve) Default – standard normal distribution – mean 0, sd = 1 gives y=0.312.. y = pdf(x) = dnorm(x) x value x axis y axis y value y = f(x) = cdf(x) The probability of getting less than x qnorm(0.758) gives x=0.7 x = cdf-1(y) = qnorm(x) pnorm(0.7) gives y=0.758.. y = cdf(x) = pnorm(x) (area under the pdf curve) x value x axis 6 STAT05 – Inferential Statistics Review of PDF and CDF – T distribution in R y axis gives random samples, unifromly distributed. , 5) must be y value y = p(x) = pdf(x) rt added to specify degrees of freedom Parameters – degrees of freedom The probability of getting (can only scale the curve) dt(0.7) gives y=0.286.. y = pdf(x) = dunif(x) exactly x x value x axis y axis y value y = f(x) = cdf(x) The probability of getting less than x qt(0.742) gives x=0.7 x = cdf-1(y) = qunif(x) pt(0.7) gives y=0.742.. y = cdf(x) = punif(x) (area under the pdf curve) x value x axis 7 STAT05 – Inferential Statistics Samples and population - sampling • Descriptive statistics which summarize the characteristics of a sample of data • Inferential statistics which attempt to say something about a population on the basis of a sample of data - infer to all on the basis of some • 'How many penguins are there on a particular ice floe in the Antarctic?‘ • Penguins tend to move around and swim off, and it's cold! So scientists use aerial photographs and statistical sampling to estimate population size. 8 STAT05 – Inferential Statistics Samples and population - sampling • Imagine a large, snow-covered, square region of the Antarctic that is inhabited by penguins. From above, it would look like a white square sprinkled with black dots: • If such access - you could count the dots to determine the number of penguins in this region. • Too large for one photo - instead take 100 photographs of the 100 smaller square sub-regions, and count them all - too long and be too expensive ! • (The total count for the population on the image is 500) 9 STAT05 – Inferential Statistics Samples and population - sampling • Another alternative - select a representative sample of the subregions, obtain photos of only these, and use the counts from these subregions to estimate the total number of penguins 10 STAT05 – Inferential Statistics Samples and population - sampling • Suppose you had access to three samples, use the results from each to obtain an estimate • Notice the effect of sample size on the estimate! 11 STAT05 – Inferential Statistics Samples and population - sampling • There is a balancing act in selecting the sample size. • A larger sample size may cost more money or be more difficult to generate, but it should provide a more accurate estimate of the population characteristic • For these sample of 10 photographs, the estimate is 450 • Estimates for the total penguin population vary quite a bit based on both the sample size and which sub-regions were sampled. The decision about how to select a sample, accordingly, is a critical one in statistics. • 12 STAT05 – Inferential Statistics Samples and population - sampling • Different ways to randomly select a sample • Think of a way to pick 10 numbers between 00 and 99 at random. • One possible method for solving [this] problem is to use two 10-sided dice, one red and one blue. • The sub-region can then be determined by the two dice (in the order red, and then blue). • This random selection process will sometimes produce duplicates. 13 STAT05 – Inferential Statistics Samples and population - sampling • For instance, you might find that seven tosses of the dice produced these sub-region choices: 19 22 39 50 34 05 39 • If we do not want duplicates, we can skip them until we get 10 distinct numbers, for example: 19 22 39 50 34 05 75 62 87 13 • This is called sampling without replacement • The estimate of the total number of penguins for the entire region based on this random sample is the mean. = 450 14 STAT05 – Inferential Statistics Effect of sample size on calculated parameters • • Answers for an estimate for a population will vary depending on which particular elements were taken in a sample To see the effects, we can perform sampling with replacement on our raisin sample in R - bootstrapping, and plot the results sample one s.d. range population one s.d. range sample mean population mean • In this plot, the darker a dot is, the more times the value occurs in a given sample 15 STAT05 – Inferential Statistics Effect of sample size on calculated parameters • Conclusion – the greater the sample size, the mean and the variance of the sample more closely approach those of the population 16 STAT05 – Inferential Statistics Effect of sample size on calculated parameters • We can repeat this exercise quite more many times, each time taking a random sample from a normally distributed variable, and showing only the variance • As sample size declines, the range of estimates of sample variance increases dramatically • • • (remember that the population variance is constant at s2=4 throughout). The problem becomes severe below samples of 13 or so, and is very serious for samples of seven or fewer. For small samples, the estimated variance is badly behaved, and this has serious consequences for estimation and hypothesis testing. 17 STAT05 – Inferential Statistics Effect of sample size on calculated parameters • • How many samples? For general statistics: – Take n=30 samples if you can afford it and you won't go far wrong . – Anything less than this is a small sample, and anything less than 10 is a very small sample. • Usually, our data forms only a small part of all the possible data we could collect. All possible users do not participate in a usability test, and every possible respondent did not answer our questions. The mean we observe therefore is unlikely to be the exact mean for the whole population - the scores of our users in a test are not going to be an exact index of how all users would perform. – How can we relate our sample to everyone else? • • 18 STAT05 – Inferential Statistics Confidence intervals • • • • If we repeatedly sample and calculate means from a population (with any distribution), our list of means will itself be normally distributed (central limit theorem) Population plot 1 s.d.range Plot of means from samples p CI (true) MeanA sample mean This implies that our observed mean follows the same rules as all data under the normal curve The distribution of means is normal around the “true” population mean – so our observed mean if 68% likely to fall within 1 SD of the true population mean, but we don’t know the “true” population mean We only have the sample, not the population – • Standard error so we use an estimate of this SD of means known as the Standard Error of the Mean – And we can say that we are 68% confident that the true mean= sample mean ± standard error of sample mean -> confidence interval A confidence interval (CI) for a population parameter is an interval between two numbers with an associated probability p – which is generated from a random sample of an underlying population. 19 STAT05 – Inferential Statistics Confidence intervals Example • • • We test 20 users on a new interface: Mean error score: 10, sd: 4 What can we infer about the broader user population? According to the central limit theorem, our observed mean (10 errors) is itself 95% likely to be within 2 s.d. of the true (but unknown to us) mean of the population • If standard error of mean = 0.89, then observed (sample) mean is within a normal distribution about the 'true' or population mean, so we can be: – 68% confident that the true mean=10 ± 0.89 – 95% confident our population mean = 10 ± 1.78 (sample mean + 2*s.e.) – 99% confident it is within 10 ± 2.67 (sample mean + 3*s.e.) 20 STAT05 – Inferential Statistics Confidence intervals Example B – assumed normal distr. • • A machine fills cups with margarine, and is supposed to be adjusted so that the mean content of the cups is close to 250 grams of margarine. (the true population mean µ should be 250, but we don’t know if it is yet) To check if the machine is adequately adjusted, a sample of n = 25 cups of margarine is chosen at random and the cups weighed. 1 n ̂ X X i n i 1 1 25 x xi 250.2 grams 25 i 1 • • general sample mean: estimator of the expectation or population mean µ This sample mean, with actual weights x1, …, x25, with This sample x1, …, x25, st.dev sn1 s 2 n 1 1 n xi x 2 n 1 i 1 There is a whole interval around the observed value 250.2 of the sample mean within which, if the whole population mean actually takes a value in this range, the observed data would not be considered particularly unusual. Such an interval is called a confidence interval for the parameter µ. 21 STAT05 – Inferential Statistics Confidence intervals Example B • In our case we may determine the endpoints by considering that the sample mean X from a normally distributed sample is also normally distributed, with the same expectation µ, but with standard deviation n 0.5 (grams) - note this is the standard error of the mean! . • So we make the standardization replacement: Z X n X 0.5 • with Z now having a standard normal distribution (mean=0, sd=1) independent of the parameter µ to be estimated. • Hence it is possible to find numbers -z and z, independent of µ, where Z lies in between with probability 1 - α, a measure of how confident we want to be. We take 1 – α = 0.95. 22 STAT05 – Inferential Statistics Confidence intervals Example B • • • We take 1 – α = 0.95, so we have P( z Z z ) 1 0.95 Z can be calculated from CDF Remember the CDF Φ(z) gives probability of z ≤ Z only! So α/2 ( z ) P( Z z ) 1 2 Φ(z) 0.975 z 1 ( z ) 1 0.975 1.96 95 % range • So X 0.95 P 1.96 1.96 n PX 0.98 X 0.98 z -z Z? • So the 95% conf. Interval is between X 0.98 X 0.98 23 STAT05 – Inferential Statistics Confidence intervals Example B • So the 95% confidence interval is between X 0.98 X 0.98 • Or, with probability 0.95 one will find the parameter µ between these stochastic endpoints (this example: 249.2 and 251.18 ) • Every time the measurements are repeated, there will be another value for the mean X of the sample. – In 95% of the cases µ will be between the endpoints calculated from this mean, but in 5% of the cases it will not be. • We cannot say: 'with probability (1 − α) the parameter μ lies in the confidence interval.' – We only know that by repetition in 100(1 − α) % of the cases μ will be in the calculated interval. 24 STAT05 – Inferential Statistics Confidence intervals Bootstrap • We can also use bootstrap to find a confidence interval – by sampling with replacement, many times from a sample, finding the means, and looking for a confidence interval based on these. • In R for this we use the quantile function, which will generate the cumulative frequency distribution based on a sample. Student’s t • For small sample sizes (n<30) instead of the CDF for the normal distribution (qnorm) we can use Student’s t distribution (qt) 25 STAT05 – Inferential Statistics Student’s t distribution • Student's t-distribution is a probability distribution that arises in the problem of estimating the mean of a normally distributed population when the sample size is small. 1 1 2 2 2 t f (t ) 1 2 • The distribution depends on ν (d.o.f) , but not μ (mean) or σ (s.d); the lack of dependence on μ and σ is what makes the t-distribution important • The overall shape of the pdf of the tdistribution resembles the bell shape of a normally distributed variable with mean 0 and variance 1, except that it is a bit lower and wider. As the number of d.o.f. grows, the t-distribution approaches the normal distribution with mean 0 and variance 1. 26 STAT05 – Inferential Statistics Student’s t distribution • Red – Student’s t distribution, 5 d.o.f • Black – standard normal distribution (mean=0, sd=1) 27 STAT05 – Inferential Statistics Single sample inference and tests • Suppose we have a single sample. The questions we might to want to answer are these: – What is the mean value? – Is the mean value significantly different from current expectation or theory? – What is the level of uncertainty associated with our estimate of the mean value? • We use statistical tests to infer significant difference in the single sample case. 28 STAT05 – Inferential Statistics Single sample inference and tests Procedure for testing a statistical hypothesis: 1. State the null hypothesis. The current knowledge (or lack of knowledge) before the experiment takes place. 2. State the alternative hypothesis. The research hypothesis that we want to prove. Our claims. 3. Choose a test statistic T. It must be suitable to differentiate between the null & alternative hypotheses. Calculate the value of T from the data. 4. Choose a significant level for the test: a = Prob. of observing a value of the statistic which falls in the critical region. It may be given. The most popular value of a is 5%. 5. Calculate the Rejection region. Acceptance region & Critical value . 6. Decision: If T falls into the rejection region we reject Ho. If T does not fall into the rejection region we do not reject Ho. Indeed, the wording is always the same for all kinds of tests, and you should try to memorize it. The abbreviated form is easier to remember: larger reject, smaller accept. 29 STAT05 – Inferential Statistics Single sample inference – Student’s t test The t statistic has a t distribution with n-1 degrees of freedom. The formal procedure is as follows: 1. Null Hypothesis : Ho : µ = µo 2. Alternative Hypothesis a) Ha : µ > µo one sided b) Ha : µ < µo one sided c) Ha : µ ≠ µo two sided x t 3. Test statistic: s n 4. Decide on the value of a. 5. Calculate the p-value. a) p-value = P(t > observed t) b) p-value = P(t < observed t) c) p-value = 2 P(t > observed |t|) The probabilities of t are based on a t dist. with n-1 d.f. 6. Reject Ho when the p-value < a. Do not reject Ho when the p-value =a 30 STAT05 – Inferential Statistics Single sample inference – Student’s t test Example: A casino makes the assumption that the average number of bets a customer plays is at least 7. A floor manager suspects that the number maybe less than 7 and in order to confirm his suspicions he takes a sample of n=6 customers and calculates the mean number of plays and the sample variance, obtaining x =6.15, s2=3.955. Perform a hypothesis test to check the manager’s suspicions. 1. Ho : µ =7 2. Ha : µ <7 3. x =6.15, s2=3.955, t=(6.15-7)/((3.955/6)^(1/2)) = -1.047 4. α =0.1. 5. df = 6-1 = 5, p-value = P(t < -1.047) = P( t > 1.047) = 0.17 ( Approx from tables) – 1-pt(1.047, 5) 6. p-value = 0.17 > α =0.1 =>We do not reject the assumption of the casino of an average of 7 or more bets per cust. 31 STAT05 – Inferential Statistics ”Tails” Predicting the direction of the difference. Since we stated that you wanted to see if [something] was BETTER (>70), not just DIFFERENT (< or > 70%), this is asking for a one-sided test…. For a one tail (directional) test - the tester narrows the odds by half by testing for a specific difference. One sided predictions specify which part of the normal curve the difference observed must reside in (left or right) For a two-sided test, one just want[s] to see if there is ANY difference (better or worse) between A and B. (here manager wants to see if a customer makes less than 7 bets – not if he makes any number of bets different from 7 – so a one-tailed test). 32 STAT05 – Inferential Statistics Two sample inference and tests • The so-called classical tests deal with some of the most frequently-used kinds of analysis, and they are the models of choice for: – comparing two variances (Fisher's F test, var.test), – comparing two sample means with normal errors (Student's t-test, t.test), – – – • comparing two sample means with non-normal errors (Wilcoxon's rank test, wilcox.test), comparing two proportions (the binomial test, prop.test), testing for independence in contingency tables (chi-square test, chisq.test or Fisher's exact test, fisher.test) First, we must realize the following: Is it right to say samples with the same mean are identical? No! – when the variances are different, don't compare the means. If you do, you run the risk of coming to entirely the wrong conclusion. 33 STAT05 – Inferential Statistics Two sample inference – Fisher F test (two variances) • Before we can carry out a test to compare two sample means, we need to test whether the sample variances are significantly different. - Fisher F test • To compare two variances, all you do is divide the larger variance by the smaller variance (this is F ratio) • In order for the variances to be significantly different, the F ratio will need to be significantly different bigger than 1 - How will we know a significant value of the variance ratio from a non-significant one? – The answer is as always to look up a critical value – this time from F distribution. • F distribution needs d.o.f (sample size-1) in the numerator and denominator of F ratio. 34 STAT05 – Inferential Statistics Two sample inference – Fisher F test (two variances) • Example: • Two gardens with n=10 entries (9 d.o.f. each), same mean, two variances null hypothesis - the two variances are not significantly different Set α=0.05 confidence level Find critical value of F for this α=0.05, and 9 d.o.f for both numerator and denumerator (through quantiles of F function) – ex. 4 Ex. F ratio of the variances is 10 which > the critical value of F (4) therefore reject the null hypothesis – accept the alternative hypothesis that the two variances are significantly different. • • • • • • Because the variances are significantly different, it would be wrong to compare the two sample means using Student's t-test. F – test – simplest analysis of variance (ANOVA) problem 35 STAT05 – Inferential Statistics Two sample inference • How likely it is that our two sample means were drawn from populations with the same average? – If the answer is highly likely, then we shall say that our two sample means are not significantly different. • There are two simple tests for comparing two sample means: – Student's t-test - when the samples are independent, the variances constant, and the errors are normally distributed, or – Wilcoxon rank-sum test when the samples are independent but the errors are not normally distributed (e.g. they are ranks or scores of some sort). 36 STAT05 – Inferential Statistics Two sample inference - Student’s t test (two means) • When we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. - t test does this. 37 STAT05 – Inferential Statistics Two sample inference - Student’s t test (two means) • When we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. - t test does this. • The test statistics (t-value) is the number of standard errors by which the two sample means are separated: t • difference between th e two means x A x B standard error of the difference SE diff Now we know the standard error of the mean, but we have not yet met the standard error of the difference between two means. For two independent (i.e. non-correlated) variables, the variance of a difference is the sum of the separate variances. SE diff s A2 s B2 n A nB 38 STAT05 – Inferential Statistics Two sample inference - Student’s t test (two means) • When we are looking at the differences between scores for two groups, we have to judge the difference between their means relative to the spread or variability of their scores. - t test does this. • The test statistics (t-value) is the number of standard errors by which the two sample means are separated: t • difference between th e two means x A x B standard error of the difference SE diff Now we know the standard error of the mean, but we have not yet met the standard error of the difference between two means. For two independent (i.e. non-correlated) variables, the variance of a difference is the sum of the separate variances. SE diff s A2 s B2 n A nB 39 STAT05 – Inferential Statistics Two sample inference – Student’s t test (two means) • Example: • • Two gardens with n=10 entries (9 d.o.f. each), different means, null hypothesis is that the two sample means are the same, and we shall accept this unless the value if Student's t is so large that it is unlikely that such a difference could have arisen by chance alone. Set α=0.05 confidence level - chance of rejecting the null hypothesis when it is true (this is the Type I error rate) Find critical value of T for this α=0.05, and 18 d.o.f for both gardens (through quantiles of T function) – ex. 2.1 Ex. The values of student t is –3.872. Take abs value 3.872 is > the critical value of T (2.10) therefore reject the null hypothesis – accept the alternative hypothesis that the two variances are significantly different. • • • • • This T – test – equivalent to one-way ANOVA problem – note ANOVA can deal with three means at once. 40 STAT05 – Inferential Statistics Review • • • Arithmetic mean Median Mode Measures of Central tendency (location) Descriptive statistics • • • • • Range Variance Standard deviation Quantiles, Interquartile range • Probability distributions – uniform, normal (Gaussian) and T • • • • Standard error – inference about unreliability Confidence interval Single sample t-test Two sample F-test and t-test. Measure of Statistical variability (dispersion - spread) Inferential statistics 41 STAT05 – Inferential Statistics Exercise for mini-module 5 – STAT03 None 42