Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Sufficient statistic wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Student's t-test wikipedia , lookup
Chapter 7 Inferences Based on a Single Sample Parameters and Statistics • A parameter is a numeric characteristic of a population or distribution, usually symbolized by a Greek letter, such as μ, the population mean. • Inferential Statistics uses sample information to estimate parameters. • A Statistic is a number calculated from data. • There are usually statistics that do the same job for samples that the parameters do for populations, such as x , the sample mean. Using Samples for Estimation μ x Sample (known statistic) Population (unknown parameter) The Idea of Estimation • We want to find a way to estimate the population parameters. • We only have information from a sample, available in the form of statistics. • The sample mean, x , is an estimator of the population mean, μ. • This is called a “point estimate” because it is one point, or a single value. Interval Estimation • There is variation in x , since it is a random variable calculated from data. • A point estimate doesn’t reveal anything about how much the estimate varies. • An interval estimate gives a range of values that is likely to contain the parameter. • Intervals are often reported in polls, such as “56% ±4% favor candidate A.” This suggests we are not sure it is exactly 56%, but we are quite sure that it is between 52% and 60%. • 56% is the point estimate, whereas (52%, 60%) is the interval estimate. The Confidence Interval • A confidence interval is a special interval estimate involving a percent, called the confidence level. • The confidence level tells how often, if samples were repeatedly taken, the interval estimate would surround the true parameter. • We can use this notation: (L,U) or (LCL,UCL). • L and U stand for Lower and Upper endpoints. The longer versions, LCL and UCL, stand for “Lower Confidence Limit” and “Upper Confidence Limit.” • This interval is built around the point estimate. Theory of Confidence Intervals • Alpha (α) represents the probability that when the sample is taken, the calculated CI will miss the parameter. • The confidence level is given by (1-α)×100%, and used to name the interval, so for example, we may have “a 90% CI for μ.” • After sampling, we say that we are, for example, “90% confident that we have captured the true parameter.” (There is no probability at this point. Either we did or we didn’t, but we don’t know.) How to Calculate CI’s • Many CI’s have the following basic structure: • P ± TS – Where P is the parameter estimate, – T is a “table” value equal to the number of standard deviations needed for the confidence level, – and S is the standard deviation of the estimate. • The quantity TS is also called the “Error Bound” (B) or “Margin of Error.” • The CI should be written as (L,U) where L= P-TS, and U= P+TS. • Don’t forget to convert your P ± TS expression to confidence interval form, including parentheses! A Confidence Interval for μ • If σ is known, and • the population is normally distributed, or n>30 (so that we can say x is approximately normally distiributed), x z / 2 x gives the endpoints for a (1- α)100% CI for μ • Note how this corresponds to the P ± TS formula given earlier. Distribution Details • What is z / 2? – α is the significance level, P(CI will miss) – The subscript on z refers to the upper tail probability, that is, P(Z>z). – To find this value in the table, look up the z-value for a probability of .5-α/2. • Examples Example: Estimation of µ ( Known) A random sample of 25 items resulted in a sample mean of 50. Construct a 95% confidence interval estimate for if = 10. x z / 2 x 10 50 1.96 25 (46.08,53.92) Confidence Interval Estimates Confidence Intervals Mean Known Proportion Unknown Variance Estimation of ( unknown) • We now turn to the situation where is unknown but the sample size is large or the sample population is normal. • Since is unknown, we use s in its place. • However, without knowing , we are not able to make use of the z table in building a confidence interval. • Instead, we will use a distribution called t (Student’s t). • The t distribution is symmetric and bell-shaped like the standard normal, and also has a =0, but >1, so the shape is flatter in the middle and thicker in the tails. Student’s t-Distributions: Normal distribution Student’s t, df = 15 Student’s t, df = 5 0 t Degrees of Freedom, df: A parameter that identifies each different distribution of Student’s t-distribution. For the methods presented in this chapter, the value of df will be the sample size minus 1, df = n - 1. Using t • As the previous graph shows, the t distribution has another parameter, called degrees of freedom (df). So this is actually a family of distributions, with different df values. • The higher the df, the closer the t distribution comes to the standard normal. • For our purposes, df=n-1. It is actually related to the denominator in the formula for s 2. • There is a t-table in the back of the book. It is different from the z-table, so we have to understand how it works. The t table • Refer to the table. First you will notice the lefthand column is for df. • When df ≥100, the z-table can be used, because the values will be very close. • This table gives tail probabilities, similar to z(). However, only a selection of probabilities is given, across the top of the table. • The interior of the table gives the t-values, so it is arranged almost opposite of the z-table. • The notation used for t-values is t(df,). • Just like z(), refers to the upper tail probability. t-Distribution Showing t(df, ): 0 t (df , ) t Example: Find the value of t(12, 0.025). 0.025 0.025 -t(12,0.025) - 2.18 Portion of t-table df 12 0 t(12,0.025) t 2.18 Amount of in one-tail 0.025 2.18 Confidence Intervals • When we build our confidence interval, refers to the probability in both tails. • This is not the same used in looking up the distribution! So what we have to look up is actually /2, because that’s the upper tail probability. • And so we come to the formula for a (1-)100% CI for when is unknown: x t( df , / 2) sx Example: A study is conducted to learn how long it takes the typical tax payer to complete his or her federal income tax return. A random sample of 17 income tax filers showed a mean time (in hours) of 7.8 and a standard deviation of 2.3. Find a 95% confidence interval for the true mean time required to complete a federal income tax return. Assume the time to complete the return is normally distributed. Solution: 1. Parameter of Interest: the mean time required to complete a federal income tax return. 2. Confidence Interval Criteria: a. Assumptions: Sampled population assumed normal, s unknown. b. Distribution table value: t will be used. c. Confidence level: 1 - α = 0.95 3. The Sample Evidence: n = 17, x = 7.8, and s = 2.3 4. Calculations: t(df , / 2) = t(16,0.025) = 2.12 s 2.3 sx = = = 0.5578 n 17 x t( df , / 2) sx = 7.8 (2.12)(.5578) (7.8 - 1.18, 7.8 1.18) = (6.62, 8.98) 5. (6.62, 8.98) is the 95% confidence interval for µ. Confidence Interval for a Proportion • Assumptions – Population Follows Binomial Distribution – Normal Approximation Can Be Used if • npˆ 3 npˆ 1 - pˆ ) does not Include 0 or 1 • Or (older guideline) • npˆ 5 and nqˆ 5 Confidence Interval Estimate pˆ (1 - pˆ ) pˆ z 2 n Example A random sample of 400 graduates showed 32 went to grad school. Set up a 95% confidence interval estimate for p. pˆ (1 - pˆ ) pˆ Z / 2 n .08 (1 - .08) .08 1.96 400 (0.053, 0.107) New Method • A new method (Agresti & Coull, 1998) can be used to avoid the problems with extreme p’s. There is no need to check the np or nq values with this method. x2 • Define p* = n4 • Then a (1-α)100% CI for p is given by p *(1 - p*) p * z / 2 n4 Example • In the 2004 presidential election, Ralph Nader had about 0.34% of the vote. Suppose an exit poll was taken to estimate Nader’s share of the vote, with a sample size of 200, and 2 people indicated they voted for Nader. • Note that with the traditional method, npˆ = 2 5 so the formula is not valid. • Use the p* method to construct a 95% CI for p. p* = x2 4 = = .0196 n 4 204 p *(1 - p*) .0196(.9804) = = .0097 n4 204 p *(1 - p*) = .0196 1.96(.0097) n4 A 95% CI for Nader's vote is (.0006,.0386). p * z / 2 Choosing CI Formulas Confidence Intervals CI for µ CI for p σ known Small Sample Population Normal Population not Normal Use z with σ Use nonparametric σ unknown Large Sample Use z with σ Small Sample Population Normal Population not Normal Use t with s Use nonparametric np>5, nq>5 Large Sample Use t or z with s Use traditional Use p* method Sample Size Calculation • We may wish to decide upon a sample size so that we can get a confidence interval with a pre-determined width. • This is common in polls, where the margin of error is usually decided in advance. • All CI’s we have seen so far have the form P±B, where B is the margin of error. • We want to fix B in advance. Sample Size for Estimating µ, σ Known • Suppose X is a random variable with σ=10 and we want a 90% CI to have a Bound, or Margin of Error, of 3. • Use the formula B = z / 2 . n 10 • Fill in the numbers: 3 = 1.645 n • Solve: 10 n = 1.645 = 5.483 n = 30.07 3 • This is the minimum sample size, but we need a whole number, so round up to n=31. Sample Size for Estimating µ, σ Unknown • If σ is unknown, the confidence interval will be calculated using the t distribution, unless n is very large. • But the degrees of freedom depend on n, which we don’t know. • The calculation also depends on s, which we don’t know until after sampling. • We must have an initial guess for s, and then use the normal distribution to approximate the t distribution, since it does not require knowing n. Example (σ unknown) • A manufacturer needs to be able to estimate the width of a new part to within 2mm with 95% confidence. There is not enough history to know what σ would be, so a pilot study is run by measuring 6 parts, and finding s=3.4mm. B = z / 2 s 3.4 3.4 2 = 1.96 n = 1.96 n = 11.1 2 n n • Rounding up to the next whole number gives n=12. Sample Size for Estimating p, a Population Proportion • With a population proportion, we also have a problem in getting the standard deviation part of the Margin of Error, since it depends on p, the thing we are trying to estimate. • There are two possibilities: – 1) We may have a preliminary guess about p that we can use, or – 2) We can use p=.5 because that maximizes the standard deviation. • The sample size will be calculated from the desired margin of error, or error bound. Example (proportion) • A pollster wants to do a simple random sample to estimate the proportion of the population favoring an increase in property taxes for school funding. He wants a margin of error of 3%, with 90% confidence. The general belief is that it will be a close election, so an initial value of p=.5 is reasonable. B = z / 2 pˆ (1 - pˆ ) .25 .5 .03 = 1.645 n = 1.645 n = 751.7 n n .03 • Rounding up to the next whole number gives n=752. Misc. Notes • The CI for µ formula using z is also called the “Large Sample” CI. It is valid when σ is known, for any sample size, but it also serves as an approximation of the t formula (using s) when n is large. How large? Many books say n≥30. I recommend making use of the t table up to n=100 since that is how far it goes. Statistical computer programs will always calculate t values, regardless of how large n is, for the σ unknown case. Misc. Notes • The CI for µ formula using t is also called the “Small Sample” CI, but only because the other one is called “Large Sample.” It is valid for any sample size when σ is unknown and the population is normal. • We do not cover methods for small samples that do not come from a normal population in this course (non-parametric methods). Misc. Notes • The t table is limited because it does not have a very good selection of probabilities. It also “jumps” in the df column. It is possible to use the “closest” value or interpolate when you can’t find what you need, but a better option is to use the Excel functions, TDIST and TINV. • However, you have to be VERY careful about what Excel is giving you. Excel’s TDIST function • TDIST takes a t value and returns the tail probability. You can choose one or two tails. Excel’s TINV Function • The TINV Function takes a two-tailed probability and returns a t-value (just what we need now). Excel Function Comparison • The NORMSINV Function, by contrast, takes a lefttailed probability and returns a z-value. This means you have to enter α/2 and take the negative, or else use 1- α/2 as the argument. Formula =NORMSINV(0.05) Result -1.644853 =-NORMSINV(0.05) =-NORMSINV(0.025) =NORMSINV(0.975) =TINV(0.05,1000) 1.644853 1.959963 1.959963 1.962339