* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Inference for 1 Sample
Survey
Document related concepts
Transcript
Purposes of These Notes • Describe point estimation, interval estimation, and hypothesis testing. • Describe a random sample. • Define a confidence interval and its level. • Derive some confidence intervals in 1 sample problems: means and proportions. • Discuss difference between Fisher and NeymanPearson. • Describe ingredients of Neyman-Pearson hypothesis testing. 1 Purposes Continued • Define null and alternative hypotheses. • Define a test statistic, rejection region, level. • Define a Type I and Type II errors. • Differentiate between one-tailed and twotailed problems. • Specific formulas for hypotheses about means and proportions. • Define a P -value. • Understand technical meaning of statistically significant. 2 List of Statistical Problems • Name most likely value of parameters: point estimation. • Name range of likely values: confidence interval. • Assess evidence against hypothesis about parameters: hypothesis testing. • Make forecasts, do interpolation. • And more. 3 Point Estimation • Estimate: number which is our best guess for parameter value. • Estimator : rule for computing estimate from data. • An estimator is a random variable which is a function of the data. • Example. Newcomb & Michelson measured speed of light in 1880s. • Made 66 measurements of time taken by light to travel 7.44373 km. 4 • Measured values are X1, X2, . . . , Xn with n = 66. • Use lower case letters for observed values. • First measurement was 24.828 millionths of a second. • Convert measurement to speed of light. • x1 = 109·7.44373/24.828 = 2.998119×108 m/s. • x2 = 2.998361 × 108 m/s. • Point estimate of speed of light is 2.998336×108 m/s. 5 Estimators • We were using the rule: average the data. • So our estimator was X + · · · + Xn . X̄ = 1 n • Model for measurement error. • Several parts: X1, . . . , Xn independent and identically distributed. 6 • Let µ = E(Xi) be the population mean. • Long run average measurement. • Population SD is σ. • Speed of light is c — standard notation. • Relate µ to c: µ = c + bias • Often assume bias is 0. 7 Newcomb data 8 Point Estimation • Have data and model for population. • Model describes population in terms of some parameters. • Binomial(n, α) model: α is a parameter. • Sample from a N (µ, σ 2) model. Parameters are µ and σ. • Sample from the Gamma density 1 x f (x; α, β) = βΓ(α) β !α−1 exp(−x/β) x > 0. • Parameters are α and β. • Generic notation: θ. 9 Standard Errors • Estimates should always be accompanied by some assessment of their likely accuracy. • For unbiased estimators with approximately normal sampling distributions we use the Standard Error. • The SE of an estimator θ̂ of θ is SE = q Var(θ̂). 10 • That is: Standard Error of an estimator is another name for its SD. • The standard error of α̂ in the Binomial(n, α) problem is q • The SE of X̄ is p(1 − p) √ n σ √ . n 11 Estimated Standard Errors • What accompanies our point estimate is a number, not a formula. • The SE is usually a formula with unknown parameters in it. • We estimate the SE by plugging in estimates of the parameters. 12 • The SE for α̂ is SE is q α(1 − α)/n so Estimated q α̂(1 − α̂) . √ n • And you plug in data to get a number to put in your report. • We use Standard Errors in Confidence Intervals. 13 Confidence Interval Definition • A level β confidence interval for a parameter θ is the interval [L, U ] between two statistics L and U such that P (L ≤ θ ≤ U ) ≥ β for all possible parameter values. • We prefer to replace ≥ by = or ≈. 14 • We use CIs by: – Deciding how to do data analysis before gathering data (decide on formulas for L and U before getting data). – Get data; compute observed values of L and U , say l and u. – Say ‘I am 100β% confident that θ is the interval [l, u]’. • 1 − β is the error rate or non-coverage rate. 15 Populations and Samples • Meaning of a sample from a population. • Population is group we want to find out about. • Can be real: all Canadian adults of working age. • Can be ‘conceptual’: all possible outcomes of some experiment. • Populations often thought of as populations of numbers. • Conceptual populations often described by probability density or pmf. 16 • Examples: heights of adults. Think of population as being normally distributed with mean µ and sd σ. • Example: repeatedly measure speed of light in a vacuum. Each measurement is ‘truth’ plus ‘measurement error’. Population of errors describe by density: N (0, σ 2) perhaps. 17 Populations and Samples • Sample is part of the group for which data is obtained. • Use n for number of items sampled. • Call it a “single sample” problem if we measure 1 number for each item sampled. • Call measurements X1, . . . , Xn . • Random sampling: fixed number of members of group selected by random mechanism playing no favourites. 18 • With replacment: pick one at a time. On ith selection each member of population has same chance of being drawn, even if that member has been picked before. • Usual model for conceptual populations. • Without replacment: pick one at a time. On ith selection each member of population who has not been drawn yet has same chance of being drawn. • Common model (sampling method) for real populations. • Neither is usual selection method in real surveys. 19 Simplest Derivation of Confidence Interval • Mathematical model for a single sample: X1, . . . , Xn are independent and identically distributed. Write ‘iid’. • Simplest populations to describe – approximately normal, like heights. • Suppose X1, . . . , Xn are independent N (µ, σ 2). • Suppose (quite unrealistically) that σ is known. • I now show you a 95% confidence interval for µ, based on the data. 20 • Consider the random variable Z= X̄ − µ √ . σ/ n • Then, regardless of what µ is, Z has a standard normal distribution. • So P (a ≤ Z ≤ b) does not depend on µ. • No matter what µ is P (−1.96 ≤ Z ≤ 1.96) = Z 1.96 −1.96 φ(z) dz = 0.95. 21 The Confidence Interval • The event −1.96 ≤ Z ≤ 1.96 can be rewritten in a number of ways. • It is the event −1.96 ≤ X̄ − µ √ ≤ 1.96. σ/ n √ • Multiply by σ/ n (which is positive): σ σ −1.96 √ ≤ X̄ − µ ≤ 1.96 √ n n Notice this is still the event. 22 • Rearrange second inequality: σ L ≡ X̄ − 1.96 √ ≤ µ n • Rearrange first inequality: σ µ ≤ X̄ + 1.96 √ ≡ U n • So no matter what µ is: P (L ≤ µ ≤ U ) = 0.95 • The range L to U is a 95% confidence interval. 23 An example with data • Simon Newcomb made 66 measurements of time taken by light to travel 7.44373 km. • I round off a bit from real data. • Convert to list of 66 speeds. • Sample mean is 299,833,533 m/s. • Temporarily assume σ = 130, 000 m/s is known. 24 • 95% confidence interval is 299, 833, 553 − 1.96 × 130, 000 √ 66 to 130, 000 299, 833, 553 + 1.96 × √ m/s. 66 • We say we are 95% confident that the speed of light is between 299,802,189 and 299,864,917 m/s. 25 Caveats and improvements • More digits than is wise but 6 leading digits worth reporting. • The quantity 130, 000 √ m/s 66 is called the standard error of the sample mean. 26 • Pretty well everything is an approximation so many data analysts round 1.96 to 2. • We are only pretending we know σ. • Usually we have to use the data to tell us about σ as well as about µ. • Notation: define upper α critical point of normal by: P (N (0, 1) > zα) = α. • So z0.025 = 1.96. 27 The role of normality • We assumed initially that the population we are sampling is, itself normally distributed. • But our basic probability was: P −1.96 ≤ ! X̄ − µ √ ≤ 1.96 σ/ n = Z 1.96 −1.96 φ(z) dz = 0.95. • Accuracy depends on sampling distribution of X̄. 28 • Central limit theorem says: if n large enough this is normal for (nearly) any population distribution. • More skewness means larger n needed. • Heavy tails mean larger n needed. • We often use rule of thumb: n ≥ 30. • Message: use same formula if n large: σ X̄ ± zα/2 √ . n 29 Unknown SD, lots of data • Actually Newcomb did not know σ at all. • He measured s, the SD of his 66 measurements. • In fact s = 130, 026 m/s. • When n is large s will be close to σ so X̄ − µ X̄ − µ ≈ √ √ . σ/ n s/ n 30 • So just replace σ by s in confidence interval. • We are 90% confident that the speed of light is in the range 130, 026 299, 833, 553 − 1.645 × √ 66 to 299, 833, 553 + 1.645 × 130, 026 √ . 66 • The Estimated Standard Error is 130, 026 √ 66 • It estimates the Standard Deviation of X̄. • Notice use of z0.05 = 1.645 not z0.025 = 1.96. 31 Small samples – Student’s t distribution • How good is the approximation? • Estimated SD of X̄ using same data from which we computed the mean. • So we should use something a bit bigger than 1.645. • For 66 observations that ‘bit bigger’ is 1.669. • Correct critical point comes from Student’s t distribution. 32 More probability – small samples • When sampling from a normally distributed population we have: P ! x X̄ − µ fT,n−1(u)du √ ≤x = S/ n −∞ Z where fT,n−1 is Student’s t-density “with n − 1 degrees of freedom”. • To be precise – but this density is not part of this course: Γ((ν + 1)/2) fT,ν (u) = √ (1 + u2/ν)−(ν+1)/2 πνΓ(ν/2) • As ν → ∞ this converges to the standard normal density. • Curve looks a lot like normal but heavier tails. 33 Specific scientific settings • Specific settings have specific formulas for Estimated SE. • Scenario 1: sample from normal population, σ (population SD) known, CI for population mean, µ. • Interval (already done) is √ X̄ ± zα/2σ/ n • Scenario 2: sample from general population, σ (population SD) unknown, sample size n large, CI for population mean, µ. • Interval (already done) is √ X̄ ± zα/2s/ n 34 • Scenario 3: sample from normal population, σ (population SD) unknown, sample size n anything, CI for population mean, µ. • Interval is √ X̄ ± tα/2,n−1s/ n • Multipliers tα/2,n−1 from other table at back of text. • Statistical packages always do Scenario 3 arithmetic. 35 Confidence intervals for proportions • Common scientific framework • Sequence of Bernoulli trials. • Number n fixed, p is “Success Probability” on each trial. • X is the number of successes. • Goal: confidence interval for proportions. • Based on Central Limit Theorem. 36 Using the CLT • Recall p̂ = X/n and X = X1 + · · · + Xn; each Xi is Bernoulli. • So p̂ is a sample mean of the Xi. • Population mean is µ = E(Xi) = p. • Population variance is σ 2 = Var(Xi ) = p(1 − p). 37 √ • So SE of p̂ is σ/ n = q √ p(1 − p)/ n. • Estimated SE is usually taken to be q √ p̂(1 − p̂)/ n. • CLT says q p̂ − p p(1 − p)/n ⇒ N (0, 1). 38 Using the CLT 2 • Law of large numbers says: lim p̂ = p n→∞ • So it is also true that p̂ − p q p̂(1 − p̂)/n ⇒ N (0, 1). • Result is: p̂ − p lim P −zα/2 ≤ q ≤ zα/2 n→∞ p̂(1 − p̂)/n = Z z α/2 −zα/2 φ(z)dz = 1 − α. • Leads to approximate level 1−α confidence interval. 39 Solving inequalities to get limits • Temporary notation A is the event −zα/2 ≤ q p̂ − p p̂(1 − p̂)/n ≤ zα/2. • Solve inequalities in A to isolate p: multiply through by SE: s p̂(1 − p̂) A = −zα/2 ≤ n p̂ − p ≤ zα/2 s p̂(1 − p̂) n . • Rearrange each inequality: right hand gives q p̂ − zα/2 p̂(1 − p̂)/n ≤ p. • Similarly for left inequality get q p ≤ p̂ + zα/2 p̂(1 − p̂)/n. 40 General points • Most essential: the meaning of confidence: • If we analyze 100 data sets and compute 100 (exact) confidence intervals at the 95% level we expect that some of the 100 intervals will contain the truth and some won’t. • The expected number which contain the truth is 95. 41 • The number which contain the truth is random. • Rule of thumb: if np > 10 and n(1−p) > 10 then normal approx is fine. • You don’t know p but you use p̂ in the rule of thumb. • Text uses 5 instead of 10. That is ok, too. 42 A catalogue of confidence intervals • Intervals for population proportions; done earlier. • Intervals for population means. – Samples from Normal populations with known σ. – Samples from Normal populations with unknown σ. – Large samples from more general populations. 43 Confidence statements, normal populations • Normal sample, σ known: X̄ − µ P (−z ≤ √ ≤ z) = Φ(z) − Φ(−z) σ/ n so if we find z so that Φ(z)−Φ(−z) = 1−α then √ √ X̄ − zσ/ n to X̄ + zσ/ n is an exact level 1 − α confidence interval for µ. • Value of z is denoted zα/2 because P (N (0, 1) > z) = α/2 = P (N (0, 1) < −z) in this case. We call zγ the upper tail γ critical point. 44 Confidence statements, normal populations • Normal sample, σ unknown: t X̄ − µ fT,n−1(u)du P (−t ≤ √ ≤ t) = S/ n −t Z Rt so if we find t so that −t fT,n−1(u)du = 1 − α then √ √ X̄ − tS/ n to X̄ + tS/ n is an exact level 1 − α confidence interval for µ. • Value of t is denoted tα/2,n−1 because P (T > t) = α/2 = P (T < −t) Again tγ,ν is notation for the upper γ critical point of a Student’s t-distribution on n − 1 degrees of freedom. 45 Confidence statements, large samples, general populations • Sample from population mean µ and unknown SD σ: t X̄ − µ P (−t ≤ fT,n−1(u)du √ ≤ t) ≈ S/ n −t ≈ Φ(t) − Φ(−t) Z so √ √ X̄ − tα/2,n−1S/ n to X̄ + tα/2,n−1S/ n and √ √ X̄ − zα/2S/ n to X̄ + zα/2S/ n both approximate large sample level 1 − α confidence intervals for µ. • Very rarely: σ is known so replace S by σ and use zα/2. 46 Confidence statements, large samples, general populations • Books traditionally recommend z for n ≥ 30 or n ≥ 40 or some such rule of thumb. • BUT I say just use t; software always does and the t approximation is generally better. • Rule of thumb comes from DARK AGES before computers when people used the tables in the book. • Those are for statistics exams, nothing else. 47 Typical hypothesis testing science questions • New drug for blood pressure. Get 200 patients. Pick 100 at random to get new drug; others get old. • Choose between two possibilities: drug reduces BP or doesn’t. • Speed of light in vacuum is known. Measure speed of neutrinos. Is speed equal to speed of light or not? 48 • Are far away galaxies moving away from earth faster than nearby ones or not? • Is speed of light same in north south and east west directions? • Does some intervention program in prison reduce recidivism or not? • Common feature: choose between two scientific alternatives. 49 Methodology • Conduct experiment in which response (BP, speed of neutrinos, two light speeds, recidivism) is measured. • Formulate statistical models: data are like a sample from a normal population; number of patients surviving has binomial distribution; north south speeds and east west speeds like samples from 2 populations. • Phrase the scientific alternatives as alternatives about the parameter values in the model: mean north south speed equals mean east west speed OR not; probability of reoffense in treatment group equals probability of re-offense in control group OR not ... 50 • Develop a rule to make a choice between two alternatives. • Understand error rates. • Apply rule to data. • Details follow. 51 Example 1: Measurement bias • Newcomb makes n = 66 measurements of time for light to travel 7.44373 km. • Modern value for that time is 24.82961 microseconds. • Is Newcomb biased? • Model: each measurement is like draw from a population of possible measurements. Data is X1, . . . , Xn sample from population with mean µ and SD σ. 52 • No bias translates to µ = 24.82961 microseconds. • We say our null hypothesis, H0, is µ = 24.82961. • Our alternative hypothesis, Ha, becomes µ 6= 24.82961. • H0 is pronounced “H nought” (“H not”). 53 The test statistic • To make the decision we find a test statistic, T , which is function of data. • It will depend on the number 24.82961 as well. • It should tend to be big if the alternative hypothesis is right. • It should NOT tend to be big if the null hypothesis is right. • We will calculate T and choose alternative if it is “too big”. 54 • First obvious suggestion: T = |X̄−24.82961|. • How big is too big? Compare T to variability of X̄ − 24.82961. • Estimate that variability using Estimated √ Standard Error s/ n of X̄. • So change to X̄ − µ 0 T = √ s/ n 55 How big is too big? • Two big approaches – assess evidence versus make firm decision. • Fisher: summarize size of T by a P -value and interpret this P value as strength of evidence against null hypothesis. • Formal decision making: select rejection region. If T lands in rejection region we reject the null hypothesis and behave as if alternative hypothesis is true. • Two approaches very closely connected. • Neyman-Pearson approach first — formal decision making. 56 • Recognize two kinds of errors. • Type I error : Newcomb has no bias but we say he did. Null hypothesis is true but we say it is false. • Type II error : Newcomb was biased but we miss that fact. Null hypothesis is false but we decide it is true. • Language used in book: reject null hypothesis or fail to reject null hypothesis. • Other places: “fail to reject” null hypothesis is called “accept null hypothesis”. You behave as if null is true. 57 Making a decision • For Newcomb our rejection region is X̄ − µ 0 T = >c √ s/ n • c is critical point. • How do we select c? • Neyman Pearson method. 58 • Choose c to control Type I error rate. • Select a pre-specified tolerable error rate: usually 5%. Call this rate α. • Find c so that PHo (T > c) = α. • PHo is notation to show that we compute this chance assuming that the null hypothesis is true. 59 Specific scientific settings • Scenario 1: sample from normal population, σ (population SD) known, hypothesis tests for population mean, µ. • Two sided alternative: H0:µ = µ0, Ha:µ 6= µ0 and X̄ − µ 0 T = √ σ/ n c = zα/2 60 • One sided alternative. H0:µ = µ0, Ha:µ > µ0 or H0:µ ≤ µ0, Ha:µ > µ0. X̄ − µ0 T = √ σ/ n and c = zα • I expect you to know what to do if inequalities reversed. 61 Scenario 2, σ unknown • Scenario 2: sample from general population, σ (population SD) unknown, sample size n large, hypothesis tests for population mean, µ. • Two sided alternative: H0:µ = µ0, Ha:µ 6= µ0 and X̄ − µ 0 T = √ s/ n c = tα/2,n−1 62 • One sided alternative. H0:µ = µ0, Ha:µ > µ0 or H0:µ ≤ µ0, Ha:µ > µ0. X̄ − µ0 T = √ s/ n and c = tα,n−1 63 Small samples • Scenario 3: sample from normal population, σ (population SD) unknown, sample size n anything, CI for population mean, µ. • Use same method as Scenario 2. • But now the method is exact. • Without the normal population assumption we are relying on the CLT and LLN and Slutsky’s theorem. 64 Hypothesis tests for proportions • Common scientific framework • Sequence of Bernoulli trials. • Number n fixed, p is “Success Probability” on each trial. • X is the number of successes. 65 • Goal is a hypothesis test for proportions. • Method based on application of the Central Limit Theorem. • Same list of null / alternative choices: H0:p = p0 or H0:p ≤ p0 • H0:p = p0 allows either 1 or 2 sided alternatives. 66 Using the CLT (repeat from CI notes!) • Recall p̂ = X/n and X = X1 + · · · + Xn; each Xi is Bernoulli. • So p̂ is a sample mean of the Xi. • Population mean is µ = E(Xi) = p. • Population variance is σ 2 = Var(Xi ) = p(1− p). √ • So SE of p̂ is σ/ n = q √ p(1 − p)/ n. • CLT says: if p = p0 then q p̂ − p0 p0(1 − p0)/n ⇒ N (0, 1). 67 Using the CLT 2 • Our test statistic is either p̂ − p0 T =q p0(1 − p0)/n for Ha:p > p0 or p̂ − p0 T = q p0(1 − p0)/n for Ha:p 6= p0 • Critical value c is zα/2 for two-sided alternative or zα for one-sided alternative. 68 Some scientific examples • Cadmium in a lake example. • n = 17 measurements of cadmium concentration. x̄ = 211, s = 15, units are parts per million or some such. (Important but these numbers are made up.) • Scientific question: decide between two possibilities – concentration below 200 vs above 200. • Typical one-sided situation. 69 • Need to connect data to scientific question of interest. • Introduce notation: X1, . . . , Xn are the 17 measurements. • Must assume that they are gathered and measured in such a way that they are a sample of size 17 from a population whose mean µ is “concentration of cadmium in the lake” • Definition of that last is scientific problem. • Issues to consider: is the whole lake sampled? are the measurements biased? are the measurement errors independent? • Assume issues dealt with. 70 Cadmium • For first pass I consider BOTH possible H0s. • For H0:µ ≤ 200 use T = X̄ − 200 √ s/ n and reject if T > t0.95,n−1 = 1.75. (Notice rejection region.) • Notice use of borderline value, 200, in T . 71 • Plug in values and find 211 − 200 √ = 3.02 T = 15/ 17 • Since 3.02 > 1.75 we reject the hypothesis that µ ≤ 200. 72 P -values • BUT: in fact we can say a bit more. • 3.02 is quite a bit bigger than 1.75. • If we had used α = 0.01 instead of 0.05 our rejection region would be T > t0.01,16 = 2.58 and we would still have rejected. • In fact we would reject for any α for which tα,16 < 3.02 73 • Smallest possible α is when tα,16 = 3.02. Or P (T16 ≤ 3.02) = 1 − α = 1 − P (T16 ≥ 3.02) • This α is Fisher’s P -value. • Compute P by finding area to right of observed statistic under null density of statistic. 74 P -values • Reject H0 at level α if P < α. • If H0 is right then P has a Uniform[0,1] distribution. • Interpret P as measure of evidence strength – smaller P , stronger evidence against H0. 75 • Call evidence statistically significant if P < 0.05. • Highly statistically significant and very highly statistically significant are often used for smaller thresholds like 0.01 or 0.001. • Some statistics packages label P -values with 1 star for P < 0.05, 2 stars for P < 0.01 and 3 stars for P < 0.001. • These are all simply conventions. • For two tailed problems: P is twice the area in the small tail. 76 Example from Devore, Page 342 Q 65 • Sample of n = 50 lens thicknesses. Given x̄ = 3.05 and s = 0.34 (all in mm). • Desired mean thickness 3.20 mm. • Do “the data strongly suggest that the true average thickness of such lenses is something other than what is desired”? • Clear two sided alternative. Null must be H0:µ = 3.20. 77 • Test statistic is 3.05 − 3.2 √ = 3.12 T = 0.34/ 50 • P value? Twice area to right of 3.12 under t on 49 df. • P = 0.003 which is very significant. (Table A.8 gives P in range of 0.002 to 0.004.) • So we see very strong evidence against the assertion that the true average thickness is 3.2mm. • We would reject null at α = 0.05 or even α = 0.01. 78 Error rates and sample size calculations • Type I error: incorrectly reject H0. • Type II error: incorrectly fail to reject H0. • Type I error rate is α; determined in advance. • Type II error rate is β – depends on what true parameter value is. • Can sometimes compute β = P (don’t reject) as a function. 79 • Answer will depend on n. • Can then sometimes choose n to give suitable sample size. • But often n depends on unknown parameters like σ. • So we design for some hoped for value of σ. 80 Sample size, Z test, 1 sided • Imagine testing µ ≤ µ0 against µ > µ0. • Assume that σ is known. • Fix some α like 0.05. • So reject if Z= X̄ − µ0 √ > zα . σ/ n • Compute β: β=P ! X̄ − µ0 √ < zα . σ/ n 81 • For β > β0 we make a type II error is Z < zα . • Centre on correct µ: β=P µ − µ0 X̄ − µ √ + √ < zα σ/ n σ/ n ! √ • Area to left of zα − (µ − µ0)/(σ/ n): β = Φ zα − µ − µ0 √ σ/ n ! 82