Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to Estimation Estimation I: 1 Biostatistics serves two purposes (among others): To use information from a sample of data to 1. Describe our best guess of the characteristics of the population. • Best guess estimation 2. Gauge the plausibility of alternative explanations for what is observed in the sample. • Hypothesis testing Estimation I: 2 Estimates vs Parameters: Sample Statistics Mean: X Variance: S2 Standard Deviation: S Population Parameters Mean: m Variance: s2 Standard Deviation: s What does it mean to say we know X from a sample, but we don‘t know m? • We can observe a sample mean and then use it as an estimate of the true, unknown population mean. Estimation I: 3 Some Notation / Definitions 1. Estimation: The computation of a statistic from sample data for the purposes of obtaining a guess of the unknown population parameter value. 2. Estimator: The label given to the statistic that is to be calculated in estimation. e.g., sample mean, X sample standard deviation, S 1. Estimate: The value of the estimator takes when calculated using an actual sample of data. e.g., x = 10 minutes s = 3 minutes Estimation I: 4 What criteria should we use to define a good estimate? 1) In the long run, “correct” if we imagine sampling over and over, the average of repeated sampling should result in the correct answer: UNBIASEDNESS 2) In the short run, “in error by as little as possible” (most of the time, it should be “close” to the true value) This is the concept of precision. It is also called the statistical concept of minimum variance. Estimation I: 5 Example: Is the sample mean or the sample median a better choice as an estimate of µ, the true population mean, for the normal distribution? 1. Unbiasedness: Both the sample mean and the median are unbiased estimates of µ. (Note this is true for the Normal Distribution, but does not hold for all distributions). 2. Precision: For the sampling distributions of sample means and sample medians, it can be shown that Variance(sample means) < Variance(sample medians) For X~ Normal, the sample mean is said to be a minimum variance unbiased estimator (mvue) of µ. Estimation I: 6 If the data are normally distributed: X ~ N(m, s2) X ~ N(m, s2/n) That is, we know that the sampling distribution of sample means from this population will follow • a Normal distribution with • the same mean as the underlying population • A decreased variance relative to the underlying population: s2/n Estimation I: 7 We can then make statements about the probability of observing a value of X within some interval around m: • X is within 1 standard error of m 68% of the time. • X is within 1.96 standard errors of m 95% of the time. • X is within 2.576 standard errors of m 99% of the time. 95% 68% m s n m m s n m 1.96 s n m m 1.96 s n Estimation I: 8 2 types of Estimators Point Estimators Single best guess Form of estimate: a value e.g., x = 10 ml Interval Estimators Range of values Form of estimate: (lower limit, upper limit) e.g., (5, 15) ml We’ve been working with point estimators; Our next step is to define interval estimators. Estimation I: 9 There are 3 ingredients to a confidence interval: 1. a point estimator (e.g., x) 2. the SE of the point estimator (e.g., s/n) 3. a confidence coefficient with an associated probability (e.g., a percentile of a Normal distribution) Estimation I: 10 The form of the confidence interval is then: Lower limit = (Point Estimator) – [(conf. coeff) (SE)] Upper limit = (Point Estimator) + [(conf. coeff) (SE)] For example, for a mean: LL = X – c (s/n) UL = X + c (s/n) Next Step: where does ‘c’ come from? Estimation I: 11 Interpretation of a 95% Confidence Interval Population m ( ) ( ) ( ( ) ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) ( In repeated sampling, each sample gives rise to a point and interval estimate of the mean. ) x Estimation I: 12 Each sample gives rise to its own • point estimate and • confidence interval estimate built around the point estimate. We will construct our intervals so that: • If all possible samples of a given sample size were drawn from the underlying distribution and • each sample gave rise to its own interval estimate • then 95% of all such intervals would include the unknown µ while 5% would not Estimation I: 13 Interpreting Confidence Intervals: Example: Take a sample, estimate xbar, and compute an interval estimate for the mean: (1.3, 9.5) Correct: “95% of intervals constructed in this manner will include the population mean m.” Incorrect: “The probability that the interval (1.3, 9.5) contains µ is 0.95.” The 2nd statement is incorrect because once an interval is constructed it either contains the mean or it doesn’t. Since we don’t know mu, we just don’t know which it is. Estimation I: 14 Computing a Confidence Interval for m Case 1: s2 known Example: The weight in micrograms of a drug inside of capsules is Normally distributed, with s2 = 0.25 We are given a sample of n=30 capsules with a mean weight of 0.51 micrograms, and asked to construct a 95% confidence interval estimate of the population mean weight. 1. The point interval estimate is x = 0.51 mgm 2. The standard error of x is: s 0.25 sx 0.091m gm n 30 Estimation I: 15 3. We set the confidence coefficient to equal z0.975=1.96 for a 95% confidence interval. We can then compute the lower and upper limits as: s LL x zl 0.51 (1.96)(0.091) 0.332 n s UL x zu 0.51 (1.96)(0.091) 0.688 n The 95% Confidence interval for the mean weight of drug per capsule is (0.33, 0.69) micrograms. Estimation I: 16 Deriving the expression for a Confidence Interval We want: Pr (zl Z zu) = 1a = 0.95 zl = 2.5th percentile of Normal zu = 97.5th percentile of Normal This area is (1- a) = .95 This shaded area is a/2 = .025 zl 0 This shaded area is a/2 = .025 zu We split the excluded area (a=.05) symmetrically around the mean. Estimation I: 17 Recall that we want the Inverse Cumulative Distribution to find a percentile of the Normal. In Minitab: Calc Probability Distributions Normal percentile: .025, .975 Estimation I: 18 We can find the percentiles for: a/2 = .05/2 = .025 z.025 = -1.96 and 1 – a/2 = 1 – (.05/2) = .975 z.975 = 1.96 Pr[-1.96 Z 1.96) = .95 We now have: This shaded area is a/2 = .025 z.025 This area is (1- a) = .95 0 This shaded area is a/2 = .025 z.975 Estimation I: 19 Derivation of a confidence interval on X: We know (simple standardization): Z X m s/ n X m Pr( zl Z zu ) Pr zl zu Substitute in for Z s/ n s s Pr zl X m zu n n Multiply by SE s s Subtract X Pr X zl m X zu n n s s Pr X zl m X zu n n This is of form: (pt. estimate) + (conf. coeff) (std error) Estimation I: 20 When zl=-1.96 and zu=1.96 , s s Pr X zl m X zu 0.95 n n We can then compute the lower and upper limits as: s LL x zl 0.51 1.96 0.091 0.332 n s UL x zu 0.51 1.96 0.091 0.688 n The 95% Confidence interval for the mean weight of drug per capsule is (0.33, 0.69) micrograms. Estimation I: 21 Bottom Line: Confidence Point Interval Estimate Estimate X Confidence Coefficient Percentile From N(0,1) Std Error s n Commonly used Confidence Coefficients from N(0,1): • For a 90% confidence interval z.95 = 1.645 • For a 95% confidence interval z.975 = 1.96 • For a 99% confidence interval z.995 = 2.576 Estimation I: 22 Example In 1990 U.S. census, the mean height of men in the US was m = 69 inches, and s = 3 inches. By the year 2004 heights may have changed but we will assume the standard deviation is the same, and known: s = 3 inches. Since we can’t afford to measure the whole population we’ll take a sample of n=100 men. We observe a mean height of x = 70 inches. a. What is the 95% confidence interval estimate of the year 2004 population mean height? b. the 99% confidence interval estimate? Estimation I: 23 Solution We want (1a) = 0.95 for a 95% confidence interval: This area is .025 .95 z.975=1.96 s 3 LL x zl ) 69.4 70 (1.96)( n 100 s 3 UL x zl ) 70.6 70 (1.96)( n 100 Estimation I: 24 The 95% C.I. estimate of the population mean height is (69.4, 70.6) inches. Do we have evidence that heights of men have changed since 1990? That is, is the 1990 mean height within this interval? Since 69 inches is outside of the 95% confidence interval, and most samples would result in a confidence interval that includes the population mean, it seems reasonable to conclude that heights of men in the U.S. may have changed. We suspect that the mean height is greater in 2004 than in 1990. Estimation I: 25 b. For desired confidence = 0.99 This area is 0.005 .99 z.995 2.576 s 3 99% CI x z.995 ) 70 (2.576)( n 100 The 99% confidence interval estimate is (69.2, 70.8) inches. Since 69 inches is outside of the 99% confidence interval, it seems reasonable to conclude that heights of men in the U.S. might have changed. The average height in 2004 appears greater than in 1990. Estimation I: 26 Note that the 99% confidence interval is wider than the 95% CI. • To have greater confidence that we know the mean, based upon the same sample, we have a wider interval of values for m. 69.2 69.4 70 70.6 70.8 x 95% CI 99% CI Estimation I: 27 Hint on the confidence coefficient: For a (1a) C.I.: use the 1a / 2 percentile of the N(0,1). a/2 a/2 (1-a) Za/2 Z1-(a/2) For example, for a 95% confidence interval, (1– a) = .95 Thus, a = .05 so that a/2 = .025 1 – (a/2) = 1 – .025 = .975 Thus, we want the .975 or 97.5th percentile for a 95% confidence interval. Estimation I: 28 Another Example: A random sample of 25 women has a mean systolic blood pressure = 120 mmhg. Assuming the underlying distribution of SBP across women is Normal with s = 10 mmhg, find the 99% confidence interval estimate of the unknown true mean, µ. Solution: 1. Point Estimate: x = 120 2. s = 10 is known sx= s / n = 10 / 25 = 10 / 5 = 2 Estimation I: 29 3. To get confidence coefficient: (1a) = 0.99 a = .01 a/2 = .005 1– (a/2) = .995 z.995 = 2.576 This area =.005 0.99 This area =.005 z.995 = 2.576 4. Confidence Interval: Pt. Est. (Conf.Coeff)(SE) s 99% CI x z.995 120 (2.576)(2) (114.9,125.2) n With 99% confidence, the mean systolic blood pressure of the population of women that this sample represents, is between 114.9 and 125.2 mmhg. Estimation I: 30 Recap: A confidence interval for a mean has the form: Confidence Point Interval Estimate Estimate X Confidence Coefficient Percentile From N(0,1) Std Error s n This holds when: 1. The data are normally distributed, with known variance, s2 2. The data are not normally distributed, but the sample size n is large, so that the sampling distribution of the mean is approx. normal Estimation I: 31 Up to this point we have been looking at • estimation of an unknown population mean (m) • using data from a sample ( x and C.I.) • assuming that we “know” the population variance, s2. In reality, we typically have to • estimate the population variance, using s2 • along with estimating the mean • from the same sample. How does this effect confidence interval estimation for a mean? Estimation I: 32 1. We know how to calculate a confidence interval for the mean when s2 is known: X Z1a /2 s n What do we do if s is UNKNOWN? 2. It seems like a reasonable idea to replace s with “s” Recall: s is the sample standard deviation n 1 2 2 2 S S , where S ( xi x ) n 1 i1 Note that estimation of s depends upon our estimate of µ . Estimation I: 33 3. The snag is that we can no longer use the multiplier z from the Normal distribution. In particular X m Z ~ N (0,1) s/ n X m t ~ ? (not Normal ) S/ n When we replace the true (but unknown) value of the standard error with an estimate of the standard error: • Instead of a Z-score, we now have • A t-statistic: X m t S/ n Estimation I: 34 This random variable, t, is said to follow • a Student’s t-distribution • with degrees of freedom = n-1 • IF the underlying data come from a Normal Distribution!! That is: If Then X ~ N (m , s ) 2 X m t ~ tn 1 S/ n Estimation I: 35 Features of the Student’s t-Distribution X m t ~ tn 1 S/ n N(0,1) tn-1 0 The Student’s t distribution is Bell-shaped Symmetric about zero Flatter than the Normal (0,1). This means • Variability is greater • More area under the tails, less at center • Resulting confidence intervals will be wider. Estimation I: 36 This greater variability or spread of the t-distribution should make intuitive sense – • we are using an estimate of the standard error rather than the true value we have added uncertainty in our confidence interval Each degree of freedom (df) defines a separate tdistribution • The greater the df, the closer to the normal distribution df = n-1 As n gets large, tn-1 N(0,1) Estimation I: 37 As n gets large, tn-1 N(0,1) t df = 5 t df = 25 Normal (0,1) Estimation I: 38 How to Use the Table of Percentiles of the Student’s t-distribution Table 5 (p. 757) in Rosner d.f. 1 2 ... t .90 3.078 1.886 … t .995 63.657 9.925 Each row gives information for a separate t-distribution defined by the df=n-1 The column heading tells you which percentile will be given to you in the body of the table. The body of the table is comprised of values of the percentile Estimation I: 39 t-distribution with 1 df This area = 0.90 .90 3.078 This number is the percentile in the body of the table. • From the first row (df=1), under the column t.90 • Read tdf=1,.90 = 3.078 That is, • Pr(tdf=1 3.078) = .90 Estimation I: 40 Using Minitab to Get Percentiles of the t-distribution Calc Probability Distributions t… Select Inverse Cumulative Prob to get percentile Enter df = n-1 Enter desired Percentile Estimation I: 41 Using Minitab to Get Percentiles of the t-distribution Inverse Cumulative Distribution Function Student's t distribution with 1 DF P(t t1,0.90 ) 0.9000 t1,0.90 3.0777 t-distribution with 1 df .90 tdf=1,.90 = 3.078 Estimation I: 42 Computation of a confidence interval for the mean when the population variance is unknown: • Replace the confidence coefficient from the N(0,1) • with one from tn-1 • use an estimate of the standard error in place of the true standard error X ( z1a / 2 )(s / n ) When s2 known X (tn1;1a / 2 )( S / n ) When s2 UNknown Estimation I: 43 Example A random sample of n = 20 recent cardiac bypass surgeries has a mean duration of x = 267 minutes, and sample variance s2 = 36,700 min2. Assuming the underlying distribution of surgery duration is normal with unknown variance, find the 90% CI estimate of the true mean duration of surgery, m. 1. x = 267 2. s = 36700 = 191.6 3. se(x) = s / n = 191.6 / 20 = 42.8 Estimation I: 44 4. Find the confidence coefficient from the t-distribution a. n = 20 df = 19 b. For (1a) = .90 a = .10 a/2 = .05 1a/2 = .95 c. tdf;1a/2 = t19;.95 = 1.729 t19 This area = .05 90% This area = .05 95% -1.73 t19;.95 =1.729 Look up value of the 95th percentile Estimation I: 45 5. 90% CI = Pt. Est. (Conf Coeff)(Std Error Est.) (t19;.95) ( s / n ) = x = 267 (1.729) (42.8) = (193.2 , 340.8) A 90% confidence interval for the true mean duration of surgery is (193.2, 340.8) minutes. Estimation I: 46 Estimation Highlights / Main Points 1. Estimation provides “guesses” of unknown population parameter values using information from a sample 2. While there may be many criteria for the selection of a “good” estimate, we’ll use two criteria: 1) UNbiasedness (in the long run, correct) 2) Minimum variance (in the short run, the smallest error possible) Estimation I: 47 1. We can calculate both point and interval estimates 2. Confidence interval estimates have the advantage of providing a sense of the precision in our data. • Wide intervals poor precision • Narrow intervals high precision Estimation I: 48 3. The width of the interval is a function of the confidence coefficient (percentile of a distribution) and the standard error greater confidence larger samples wider intervals more narrow intervals (as n gets large s/n or s/n gets smaller ) Estimation I: 49 Computing a Confidence Interval for m s2 known s2 NOT known Point Estimate x x What to use for variance s s s n SE of x x- m Standardizatio n Distribution of z= s t= n Confidence Interval x ± z[s n x- m s n Student’s t, with df=n-1 Normal(0,1) Standard Score Assumptions s n ] Random Sample from Normal, or n large x ± t [s n ] Random Sample from Normal distribution Estimation I: 50