Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistical Inference Making decisions regarding the population base on a sample Decision Types • Estimation – Deciding on the value of an unknown parameter • Hypothesis Testing – Deciding a statement regarding an unknown parameter is true of false • All decisions will be based on the values of statistics Estimation • Definitions – An estimator of an unknown parameter is a sample statistic used for this purpose – An estimate is the value of the estimator after the data is collected • The performance of an estimator is assessed by determining its sampling distribution and measuring its closeness to the parameter being estimated Examples of Estimators The Sample Proportion Let p = population proportion of interest or binomial probability of success. Let X no. of succeses pˆ n no. of bimomial trials = sample proportion or proportion of successes. Then the sampling distributi on of p̂ is a normal distribution with mean pˆ p pˆ p(1 p) n Sampling distributi on of p̂ 30 25 20 15 c 10 5 0 0 0.1 0.2 0.3 pˆ p 0.4 0.5 0.6 0.7 0.8 0.9 1 The Sample Mean Let x1, x2, x3, …, xn denote a sample of size n from a normal distribution with mean and standard deviation . n Let x x i i 1 n sample mean Then the sampling distributi on of x is a normal distribution with mean x x n 0.3 Sampling distributi on of x 0.25 population 0.2 n =5 n = 10 0.15 n = 15 c n = 20 0.1 0.05 0 80 90 100 x 110 120 Confidence Intervals Estimation by Confidence Intervals • Definition – An (100) P% confidence interval of an unknown parameter is a pair of sample statistics (t1 and t2) having the following properties: 1. P[t1 < t2] = 1. That is t1 is always smaller than t2. 2. P[the unknown parameter lies between t1 and t2] = P. • the statistics t1 and t2 are random variables • Property 2. states that the probability that the unknown parameter is bounded by the two statistics t1 and t2 is P. Critical values for a distribution • The a upper critical value for a any distribution is the point xa underneath the distribution such that P[X > xa] = a a xa Critical values for the standard Normal distribution P[Z > za] = a a za Critical values for the standard Normal distribution P[Z > za] = a Confidence Intervals for a proportion p Let t1 pˆ za / 2 pˆ pˆ za / 2 pˆ za / 2 pˆ 1 pˆ n and t2 pˆ za / 2 pˆ pˆ za / 2 pˆ za / 2 pˆ 1 pˆ n p1 p n p1 p n Then t1 to t2 is a (1 – a)100% = P100% confidence interval for p z Logic: pˆ p pˆ has a Standard Normal distribution Then P za z za 1 a P pˆ p and P za / 2 za / 2 1 a pˆ P za / 2 pˆ pˆ p za / 2 pˆ 1 a P za / 2 pˆ p pˆ za / 2 pˆ 1 a Hence P pˆ za / 2 pˆ p pˆ za / 2 pˆ 1 a Pt1 p t2 1 a Thus t1 to t2 is a (1 – a)100% = P100% confidence interval for p Example • Suppose we are interested in determining the success rate of a new drug for reducing Blood Pressure • The new drug is given to n = 70 patients with abnormally high Blood Pressure • Of these patients to X = 63 were able to reduce the abnormally high level of Blood Pressure • The proportion of patients able to reduce the abnormally high level of Blood Pressure was X 63 pˆ 0.900 n 70 If P = 1 – a = 0.95 then a/2 = .025 Then and and za = 1.960 pˆ 1 pˆ t1 pˆ za / 2 n 0.900.10 (0.90) (1.960) 70 (0.90) .0703 0.8297 pˆ 1 pˆ t2 pˆ za / 2 n 0.900.10 (0.90) (1.960) 70 (0.90) .0703 0.9703 Thus a 95% confidence interval for p is 0.8297 to 0.9703 Confidence Interval for a Proportion 100P% Confidence Interval for the population proportion: pˆ za / 2 pˆ pˆ p1 p n pˆ 1 pˆ n za / 2 upper a / 2 critical point of the standard normal distribtio n Interpretation: For about 100P% of all randomly selected samples from the population, the confidence interval computed in this manner captures the population proportion. Error Bound For a (1 – a)% confidence level, the approximate margin of error in a sample proportion is Error Bound za pˆ 1 pˆ n Factors that Determine the Error Bound 1. The sample size, n. When sample size increases, margin of error decreases. 2. The sample proportion, p̂ . If the proportion is close to either 1 or 0 most individuals have the same trait or opinion, so there is little natural variability and the margin of error is smaller than if the proportion is near 0.5. 3. The “multiplier” za/2. Connected to the “(1 – a)%” level of confindence of the Error Bound. The value of za/2 for a 95% level of confidence is 1.96 This value is changed to change the level of confidence. Determination of Sample Size In almost all research situations the researcher is interested in the question: How large should the sample be? Answer: Depends on: • How accurate you want the answer. Accuracy is specified by: • Specifying the magnitude of the error bound • Level of confidence Error Bound: B za / 2 p1 p za / 2 n pˆ 1 pˆ n • If we have specified the level of confidence then the value of za/2 will be known. • If we have specified the magnitude of B, it will also be known Solving for n we get: za2/ 2 p1 p za2/ 2 p * 1 p * n 2 2 B B Summarizing: The sample size that will estimate p with an Error Bound B and level of confidence P = 1 – a is: za2/ 2 p * 1 p * n 2 B where: • B is the desired Error Bound • za is the a/2 critical value for the standard normal distribution • p* is some preliminary estimate of p. If you do not have a preliminary estimate of p, use p* = 0.50 Reason 2 za / 2 p * 1 p * n For p* = 0.50 B2 n will take on the largest value. 3000 2500 n 2000 1500 1000 500 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 p* Thus using p* = 0.50, n may be larger than required if p is not 0.50. but will give the desired accuracy or better for all values of p. Example • Suppose that I want to conduct a survey and want to estimate p = proportion of voters who favour a downtown location for a casino: I know that the approximate value of p is • p* = 0.50. This is also a good choice for p if one has no preliminary estimate of its value. • I want the survey to estimate p with an error bound B = 0.01 (1 percentage point) • I want the level of confidence to be 95% (i.e. a = 0.05 and za = z0.05 = 1.960 Then 2 1.960 0.500.50 n 9604 2 0.01 Confidence Intervals for the mean of a Normal Population, Let and t1 x za / 2 x x za / 2 n t2 x za / 2 x x za / 2 n Then t1 to t2 is a (1 – a)100% = P100% confidence interval for Logic: z x has a Standard Normal distribution x Then P za z za 1 a P x and P za / 2 za / 2 1 a x Hence Px za / 2 x x za / 2 x 1 a Pt1 t2 1 a Thus t1 to t2 is a (1 – a)100% = P100% confidence interval for p Example • Suppose we are interested average Bone Mass Density (BMD) for women aged 70-75 • A sample n = 100 women aged 70-75 are selected and BMD is measured for eahc individual in the sample. • The average BMD for these individuals is: x 25.63 • The standard deviation (s) of BMD for these individuals is: s 7.82 If P = 1 – a = 0.95 then a/2 = .025 Then t1 x za / 2 x za / 2 and za = 1.960 s n n 7.82 25.63 1.960 25.63 1.53 24.10 100 and t2 x za / 2 x za / 2 s n n 7.82 25.63 1.960 25.63 1.53 27.16 100 Thus a 95% confidence interval for is 24.10 to 27.16 Determination of Sample Size Again a question to be asked: How large should the sample be? Answer: Depends on: • How accurate you want the answer. Accuracy is specified by: • Specifying the magnitude of the error bound • Level of confidence Error Bound: B za / 2 n • If we have specified the level of confidence then the value of za/2 will be known. • If we have specified the magnitude of B, it will also be known Solving for n we get: z z s * n 2 B B2 2 a/2 2 2 a/2 2 Summarizing: The sample size that will estimate with an Error Bound B and level of confidence P = 1 – a is: z z s * n 2 2 B B 2 a/2 2 2 a/2 2 where: • B is the desired Error Bound • za is the a/2 critical value for the standard normal distribution • s* is some preliminary estimate of s. Notes: z z s * n 2 2 B B 2 a/2 2 2 a/2 2 • n increases as B, the desired Error Bound, decreases – Larger sample size required for higher level of accuracy • n increases as the level of confidence, (1 – a), increases – za increases as a/2 becomes closer to zero. – Larger sample size required for higher level of confidence • n increases as the standard deviation, , of the population increases. – If the population is more variable then a larger sample size required Summary: The sample size n depends on: • Desired level of accuracy • Desired level of confidence • Variability of the population Example • Suppose that one is interested in estimating the average number of grams of fat (m) in one kilogram of lean beef hamburger : This will be estimated by: • randomly selecting one kilogram samples, then • Measuring the fat content for each sample. • Preliminary estimates of and indicate: – that and are approximately 220 and 40 respectively. • I want the study to estimate with an error bound 5 and • a level of confidence to be 95% (i.e. a = 0.05 and za = z0.05 = 1.960) Solution 1.960 40 n 2 5 2 2 245.9 246 Hence n = 246 one kilogram samples are required to estimate within B = 5 gms with a 95% level of confidence. Confidence Intervals Confidence Interval for a Proportion pˆ za / 2 pˆ pˆ p1 p n pˆ 1 pˆ n za / 2 upper a / 2 critical point of the standard normal distribtio n B za / 2 pˆ za / 2 pˆ p 1 p n za / 2 pˆ Error Bound pˆ 1 pˆ n Determination of Sample Size The sample size that will estimate p with an Error Bound B and level of confidence P = 1 – a is: za2/ 2 p * 1 p * n B2 where: • B is the desired Error Bound • za is the a/2 critical value for the standard normal distribution • p* is some preliminary estimate of p. Confidence Intervals for the mean of a Normal Population, x za / 2 x or x za / 2 or x za / 2 n s n x sample mean za / 2 upper a / 2 critical point of the standard normal distribtio n s sample standard deviation Determination of Sample Size The sample size that will estimate with an Error Bound B and level of confidence P = 1 – a is: z z s * n 2 2 B B 2 a/2 2 2 a/2 2 where: • B is the desired Error Bound • za is the a/2 critical value for the standard normal distribution • s* is some preliminary estimate of s. Hypothesis Testing An important area of statistical inference Definition Hypothesis (H) – Statement about the parameters of the population • In hypothesis testing there are two hypotheses of interest. – The null hypothesis (H0) – The alternative hypothesis (HA) Either – null hypothesis (H0) is true or – the alternative hypothesis (HA) is true. But not both We say that are mutually exclusive and exhaustive. One has to make a decision – to either to accept null hypothesis (equivalent to rejecting HA) or – to reject null hypothesis (equivalent to accepting HA) There are two possible errors that can be made. 1. Rejecting the null hypothesis when it is true. (type I error) 2. accepting the null hypothesis when it is false (type II error) An analogy – a jury trial The two possible decisions are – Declare the accused innocent. – Declare the accused guilty. The null hypothesis (H0) – the accused is innocent The alternative hypothesis (HA) – the accused is guilty The two possible errors that can be made: – Declaring an innocent person guilty. (type I error) – Declaring a guilty person innocent. (type II error) Note: in this case one type of error may be considered more serious Decision Table showing types of Error H0 is True H0 is False Accept H0 Correct Decision Type II Error Reject H0 Type I Error Correct Decision To define a statistical Test we 1. Choose a statistic (called the test statistic) 2. Divide the range of possible values for the test statistic into two parts • The Acceptance Region • The Critical Region To perform a statistical Test we 1. Collect the data. 2. Compute the value of the test statistic. 3. Make the Decision: • If the value of the test statistic is in the Acceptance Region we decide to accept H0 . • If the value of the test statistic is in the Critical Region we decide to reject H0 . Example We are interested in determining if a coin is fair. i.e. H0 : p = probability of tossing a head = ½. To test this we will toss the coin n = 10 times. The test statistic is x = the number of heads. This statistic will have a binomial distribution with p = ½ and n = 10 if the null hypothesis is true. Sampling distribution of x when H0 is true 0.3 0.25 0.2 0.15 0.1 0.05 0 0 1 2 3 4 5 6 7 8 9 10 Note We would expect the test statistic x to be around 5 if H0 : p = ½ is true. Acceptance Region = {3, 4, 5, 6, 7}. Critical Region = {0, 1, 2, 8, 9, 10}. The reason for the choice of the Acceptance region: Contains the values that we would expect for x if the null hypothesis is true. Definitions: For any statistical testing procedure define 1. a = P[Rejecting the null hypothesis when it is true] = P[ type I error] . b = P[accepting the null hypothesis when it is false] = P[ type II error] In the last example 1. a = P[ type I error] = p(0) + p(1) + p(2) + p(8) + p(9) + p(10) = 0.109, where p(x) are binomial probabilities with p = ½ and n = 10 . . b = P[ type II error] = p(3) + p(4) + p(5) + p(6) + p(7), where p(x) are binomial probabilities with p (not equal to ½) and n = 10. Note: these will depend on the value of p. Table: Probability of a Type II error, b vs. p p 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 b 0.070 0.322 0.616 0.820 0.820 0.616 0.322 0.070 Note: the magnitude of b increases as p gets closer to ½. Comments: 1. You can control a = P[ type I error] and b = P[ type II error] by widening or narrowing the acceptance region. . 2. Widening the acceptance region decreases a = P[ type I error] but increases b = P[ type II error]. 3. Narrowing the acceptance region increases a = P[ type I error] but decreases b = P[ type II error]. Example – Widening the Acceptance Region 1. Suppose the Acceptance Region includes in addition to its previous values 2 and 8 then a = P[ type I error] = p(0) + p(1) + p(9) + p(10) = 0.021, where again p(x) are binomial probabilities with p = ½ and n = 10 . . b = P[ type II error] = p(2) + p(3) + p(4) + p(5) + p(6) + p(7) + p(8). Tabled values of are given on the next page. Table: Probability of a Type II error, b vs. p p 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 b 0.264 0.624 0.851 0.952 0.952 0.851 0.624 0.264 Note: Compare these values with the previous definition of the Acceptance Region. They have increased, Example – Narrowing the Acceptance Region 1. Suppose the original Acceptance Region excludes the values 3 and 7. That is the Acceptance Region is {4,5,6}. Then a = P[ type I error] = p(0) + p(1) + p(2) + p(3) + p(7) + p(8) +p(9) + p(10) = 0.344. . b = P[ type II error] = p(4) + p(5) + p(6) . Tabled values of are given on the next page. Table: Probability of a Type II error, b vs. p p 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 b 0.013 0.120 0.340 0.563 0.563 0.340 0.120 0.013 Note: Compare these values with the otiginal definition of the Acceptance Region. They have decreased, Acceptance Region Acceptance Region Acceptance Region {2,3,4,5,6,7,8}. {3,4,5,6,7}. {4,5,6}. a = 0.021 p 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 b 0.264 0.624 0.851 0.952 0.952 0.851 0.624 0.264 a = 0.109 p 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 b 0.070 0.322 0.616 0.820 0.820 0.616 0.322 0.070 a = 0.344 p 0.1 0.2 0.3 0.4 0.6 0.7 0.8 0.9 b 0.013 0.120 0.340 0.563 0.563 0.340 0.120 0.013 The Approach in Statistical Testing is: • Set up the Acceptance Region so that a is close to some predetermine value (the usual values are 0.05 or 0.01) • The predetermine value of a (0.05 or 0.01) is called the significance level of the test. • The significance level of the test is a = P[test makes a type I error] The z-test for Proportions Testing the probability of success in a binomial experiment Situation • A success-failure experiment has been repeated n times • The probability of success p is unknown. We want to test – H0: p = p0 (some specified value of p) Against – HA: p p0 The Data • The success-failure experiment has been repeated n times • The number of successes x is observed. x pˆ the poportion of successes n • Obviously if this proportion is close to p0 the Null Hypothesis should be accepted otherwise the null Hypothesis should be rejected. The Test Statistic • To decide to accept or reject the Null Hypothesis (H0) we will use the test statistic z pˆ p0 pˆ pˆ p0 p0 1 p0 n • If H0 is true we should expect the test statistic z to be close to zero. • If H0 is true we should expect the test statistic z to have a standard normal distribution. • If HA is true we should expect the test statistic z to be different from zero. The sampling distribution of z when H0 is true: The Standard Normal distribution 0 Reject H0 Accept H0 z Reject H0 The Acceptance region: a/2 a/2 Reject H0 za / 2 0 za / 2 Accept H0 z Reject H0 PAccept H 0 when true P za / 2 z za / 2 1 a PReject H 0 when true Pz za / 2 or z za / 2 a • Acceptance Region za / 2 z za / 2 – Accept H0 if: • Critical Region – Reject H0 if: z za / 2 or z za / 2 • With this Choice PType I Error PReject H 0 when true Pz za / 2 or z za / 2 a Summary To Test for a binomial probability p H0: p = p0 (some specified value of p) Against HA: p p0 we 1. Decide on a = P[Type I Error] = the significance level of the test (usual choices 0.05 or 0.01) 2. Collect the data 3. Compute the test statistic z pˆ p0 pˆ pˆ p0 p0 1 p0 n 4. Make the Decision • Accept H0 if: za / 2 z za / 2 • Reject H0 if: z za / 2 or z za / 2 Example • In the last election the proportion of the voters who voted for the Liberal party was 0.08 (8 %) • The party is interested in determining if that percentage has changed • A sample of n = 800 voters are surveyed We want to test – H0: p = 0.08 (8%) Against – HA: p 0.08 (8%) Summary 1. Decide on a = P[Type I Error] = the significance level of the test Choose (a = 0.05) 2. Collect the data • The number in the sample that support the liberal party is x = 92 x 92 pˆ 0.115 (11.5%) n 800 3. Compute the test statistic pˆ p0 pˆ p0 z pˆ p0 1 p0 n 0.115 0.80 3.649 0.801 0.80 800 4. Make the Decision za / 2 z0.025 1.960 • Accept H0 if: 1.960 z 1.960 • Reject H0 if: z 1.960 or z 1.960 Since the test statistic is in the Critical region we decide to Reject H0 Conclude that H0: p = 0.08 (8%) is false There is a significant difference (a = 5%) in the proportion of the voters supporting the liberal party in this election than in the last election The one tailed z-test • A success-failure experiment has been repeated n times • The probability of success p is unknown. We want to test – H0: p p0 (some specified value of p) Against – HA: p p0 • The alternative hypothesis is in this case called a one-sided alternative The Test Statistic • To decide to accept or reject the Null Hypothesis (H0) we will use the test statistic z pˆ p0 pˆ pˆ p0 p0 1 p0 n • If H0 is true we should expect the test statistic z to be close to zero or negative • If p = p0 we should expect the test statistic z to have a standard normal distribution. • If HA is true we should expect the test statistic z to be a positive number. The sampling distribution of z when p = p0 : The Standard Normal distribution 0 Accept H0 z Reject H0 The Acceptance and Critical region: a 0 Accept H0 za z Reject H0 PAccept H 0 when true Pz za 1 a PReject H 0 when true P z za a • Acceptance Region – Accept H0 if: z za • Critical Region – Reject H0 if: z za • The Critical Region is called one-tailed • With this Choice PType I Error PReject H 0 when true Pz za a Example • A new surgical procedure is developed for correcting heart defects infants before the age of one month. • Previously the procedure was used on infants that were older than one month and the success rate was 91% • A study is conducted to determine if the success rate of the new procedure is greater than 91% (n = 200) We want to test – H0: p 0.91 (91%) Against – HA: p 0.91 (91%) p the success rate of the new procedure Summary 1. Decide on a = P[Type I Error] = the significance level of the test Choose (a = 0.05) 2. Collect the data • The number of successful operations in the sample of 200 cases is x = 187 x 187 pˆ 0.935 (93.5%) n 200 3. Compute the test statistic pˆ p0 pˆ p0 z pˆ p0 1 p0 n 0.935 0.91 1.235 0.911 0.91 200 4. Make the Decision za z0.05 1.645 • Accept H0 if: z 1.645 • Reject H0 if: z 1.645 Since the test statistic is in the Acceptance region we decide to Accept H0 Conclude that H0: p 0.91 (91%) is true There is a no significant (a = 5%) increase in the success rate of the new procedure over the older procedure Comments • When the decision is made to accept H0 is made it should not be conclude that we have proven H0. • This is because when setting up the test we have not controlled b = P[type II error] = P[accepting H0 when H0 is FALSE] • Whenever H0 is accepted there is a possibility that a type II error has been made. In the last example The conclusion that there is a no significant (a = 5%) increase in the success rate of the new procedure over the older procedure should be interpreted: We have been unable to proof that the new procedure is better than the old procedure An analogy – a jury trial The two possible decisions are – Declare the accused innocent. – Declare the accused guilty. The null hypothesis (H0) – the accused is innocent The alternative hypothesis (HA) – the accused is guilty The two possible errors that can be made: – Declaring an innocent person guilty. (type I error) – Declaring a guilty person innocent. (type II error) Note: in this case one type of error may be considered more serious Requiring all 12 jurors to support a guilty verdict : – Ensures that the probability of a type I error (Declaring an innocent person guilty) is small. – However the probability of a type II error (Declaring an guilty person innocent) could be large. Hence: When decision of innocence is made: – It is not concluded that innocence has been proven but that – we have been unable to disprove innocence The z-test for the Mean of a Normal Population We want to test, , denote the mean of a normal population Situation • A success-failure experiment has been repeated n times • The probability of success p is unknown. We want to test – H0: p = p0 (some specified value of p) Against – HA: p p0 The Data • Let x1, x2, x3 , … , xn denote a sample from a normal population with mean and standard deviation . • Let n x x i 1 n i the sample mean • we want to test if the mean, , is equal to some given value 0. • Obviously if the sample mean is close to 0 the Null Hypothesis should be accepted otherwise the null Hypothesis should be rejected. The Test Statistic • To decide to accept or reject the Null Hypothesis (H0) we will use the test statistic z x 0 x x 0 n x 0 x 0 n s n • If H0 is true we should expect the test statistic z to be close to zero. • If H0 is true we should expect the test statistic z to have a standard normal distribution. • If HA is true we should expect the test statistic z to be different from zero. The sampling distribution of z when H0 is true: The Standard Normal distribution 0 Reject H0 Accept H0 z Reject H0 The Acceptance region: a/2 a/2 Reject H0 za / 2 0 za / 2 Accept H0 z Reject H0 PAccept H 0 when true P za / 2 z za / 2 1 a PReject H 0 when true Pz za / 2 or z za / 2 a • Acceptance Region za / 2 z za / 2 – Accept H0 if: • Critical Region – Reject H0 if: z za / 2 or z za / 2 • With this Choice PType I Error PReject H 0 when true Pz za / 2 or z za / 2 a Summary To Test for a binomial probability p H0: = 0 (some specified value of p) Against HA : 0 1. Decide on a = P[Type I Error] = the significance level of the test (usual choices 0.05 or 0.01) 2. Collect the data 3. Compute the test statistic z n x 0 x 0 n s 4. Make the Decision • Accept H0 if: za / 2 z za / 2 • Reject H0 if: z za / 2 or z za / 2 Example A manufacturer Glucosamine capsules claims that each capsule contains on the average: • 500 mg of glucosamine To test this claim n = 40 capsules were selected and amount of glucosamine (X) measured in each capsule. Summary statistics: x 496.3 and s 8.5 We want to test: H 0 : 500 Manufacturers claim is correct against H A : 500 Manufacturers claim is not correct The Test Statistic z x 0 x x 0 n x 0 x 0 n s n 496.3 500 40 8.5 2.75 The Critical Region and Acceptance Region Using a = 0.05 za/2 = z0.025 = 1.960 We accept H0 if -1.960 ≤ z ≤ 1.960 reject H0 if z < -1.960 or z > 1.960 The Decision Since z= -2.75 < -1.960 We reject H0 Conclude: the manufacturers’s claim is incorrect: