Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of statistics wikipedia , lookup
Psychometrics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Lecture note 7 Sampling Distribution Estimation Hypothesis testing 1 Sampling distribution Let X1, X2 , …, Xn be independent random variables. Further assume that all the random variables have the normal distribution with mean μ and variance σ2, i.e., N(μ, σ2) The sample mean of the above random variables is defined as 1 n X Xi n i 1 2 Thus, the sample mean is a linear combination of n random variables. Since it is a linear combination of random variables, the sample mean itself is a random variable as well Then, what is the distribution of the sample mean? I will follow 3 steps to show you the distribution of the sample mean. 3 Step1: Notice that the expectation of the sample mean is μ. This is because: 1 E[ X ] E[ n 1 n 1 n n E[ X i 1 i n X i 1 i ] (1) ] ( 2) n (3) i 1 4 To understand the second line (2) of the previous slide, consider the simplest case where there are only two random variables. Then 1 1 X X1 X 2 2 2 Thus 1 1 1 2 E[ X ] E[ X 1 ] E[ X 2 ] 2 2 2 i 1 The equation (2) is a simple extension of this exercise to the case where there are n 5 random variables The third line (3) follows from the assumption that all the n random variables follows N(μ, σ2), thus has identical mean: μ. Therefore, the expectation of the sample mean is equal to μ. Noting this is the first step. 6 Step 2: Note that the variance of the sample mean is given by 1 Var[ X ] Var[ n 1 2 n 1 2 n n Var[ X i 1 i n X i 1 i ] (1) ] ( 2) n 2 (3) i 1 n 2 n2 2 n 7 To understand the second line (2) of the previous slide, consider the simplest case where there are only two random 1 variables. Then X 1 X X 2 2 Thus 1 Var[ X ] 2 1 1 1 1 var[ X ] Var [ X ] 2 COV ( X 1 , X 2 ) 1 2 2 2 2 2 2 2 Since X1 and X2 are independent by assumption, Cov(X1, X2)=0. This means that 1 1 1 Var[ X ] 2 var[ X 1 ] 2 Var[ X 2 ] 2 2 2 2 2 2 i 1 8 The line (2) is a simple extension of this to the case of n random variables. The line (3) follows from the assumption that all n random variables follow N(μ, σ2), thus have identical variance: σ2. Thus, the variance of the sample mean is σ2/n. Noting this is the second step. 9 Step 3: In this step, we use the fact that a linear combination of normal random variables is also a normal random variable. 1 1 1 Since X X 1 X 2 ... X n n n n the sample mean is the a linear combination of normal random variables. Therefore, the sample mean X is a normal random variable 10 Combining the results from step 1 and step 2, the distribution of the sample mean is X ~ N ( , 2 / n) Next slide summarizes this finding 11 The distribution of the sample mean Let X1, X2 , …, Xn be independent random variables with identical distribution, N(μ, σ2) Define the sample mean of the above random variables as 1 n X Xi n i 1 Then the sample mean follows X ~ N ( , / n) 2 12 Exercise 1 Let X1, X2 , …, X64 be independent random variables with identical distribution, N(0, 1) What is the distribution of the sample mean 1 X X where n is the defined as n number of the observations (i.e., the number of variables)? n i 1 i 13 Estimation 14 Estimation Suppose that the monthly sales of a shop in the past 9 month is given by Month April May June July August September October November December Revenue in 1000 yen 400 200 150 400 100 80 160 150 200 15 In estimation, we consider a data set as random draws from an unknown distribution. For example, we consider that there is an unknown distribution function which characterizes the monthly sales of the store. Then, we consider the data in the previous page as realized values of 9 independent draws from this unknown distribution. 16 In other words, we consider the 9 data points (from April to December) as the realized value of 9 random variables, X1, X2, …, X9, which are independently distributed, and which have identical distribution. They are independent because they are `random draws’ from the population. `Identical distribution’ is the assumption. 17 The purpose of statistics is to estimate a parameter of the unknown distribution, such as the population mean. In this section, I focus on the estimation of the population mean μ. 18 Point and Interval Estimates There are two types of estimates. A point estimate is a single number, A confidence interval provides additional information about variability Lower Confidence Limit Point Estimate Upper Confidence Limit Confidence interval 19 The point estimator of the population mean μ Let X1, X2,…, Xn be the data (n random draws from an unknown distribution). Then the point estimator of the population mean μ is given by 1 n X Xi n i 1 20 The Point Estimate of the population mean. We can estimate the unknown population parameter … Population Mean=μ by using the sample mean (a Point Estimate) x 21 Distribution unknown but ‘normal’ The type of distribution that characterizes a data set can be anything. However, in this handout, we consider the case where the distribution is normal with unknown mean and variance. This normal assumption simplifies the analysis. 22 It is know that, even if the distribution that characterizes the data is not normal, a normal distribution can be used as a good approximation. Thus, we focus on the case where the population distribution is normal. 23 Exercise 2 Month April May June July August September October November December Sales in 1000 yen 400 200 150 400 100 80 160 150 200 Suppose the monthly sales of a store follow a normal distribution with unknown mean and variance. Compute the point estimate of the population mean. This data is stored in the file ‘Monthly sales data’ 24 Confidence interval Confidence interval: An interval that contains the population mean μ with a given probability. An interval estimate provides more information about the population than does a point estimate. In particular, it can show the uncertainty associated with the point estimate. 25 An example of a confidence interval Suppose that X1, X2, .., X20 are a random sample taken from the population having the normal distribution with variance 4, but unknown mean μ. If the sample mean is 0.5, then the population mean μ is in the interval [-0.3765, 1.3765] with probability 95%. Proof (important) See the front board 26 The interval [-0.3765, 1.3765] is an example of the interval estimate. In particular, the interval in this example is called the 95% confidence interval, since the population mean would fall in this interval with probability equal to 95% I am 95% confident that μ is between -0.3765 & 1.3765. 27 Confidence interval estimator Confidence level: The probability you choose in order to estimate the confidence interval. This is usually set at 95%. Significance level: A small number α such that the confidence level =100*(1- α) For example, if you set confidence level at 95%, then the significance level =0.05. 28 I explain the construction of the confidence interval for two cases in order for you to understand the concept more easily. Confidence interval for the population mean μ Case 1 The population variance σ2 known Case 2 The population variance σ2 unknown 29 Case 1: Confidence interval of the population mean when the population variance is known Let confidence level be 100*(1- α). Then, define a number Z α/2 as the number satisfying the following P(Z> Z α/2 )= α/2 where Z is the standard normal random variable. The definition of Z α/2 is illustrated in the following slide. 30 Definition of Z α/2 Standard normal distribution N(0,1) /2 0 Z α/2 x Thus, Z α/2 is a cutoff point where right tail probability is equal to α/2. 31 Exercises 3 Q1. Find Z α/2 when α=0.05 Q1 Find Z α/2 when α=0.10 32 The confidence interval estimator for the population mean when the population variance is known Let X1, X2,…, Xn be a random draw from the normal distribution with unknown population mean μ, but known population variance σ2. Then 100*(1- α) confidence interval for the unknown population mean μ is given by x z α/2 σ σ μ x z α/2 n n Proof (Important): See the front board 33 Exercise 4 Month April May June July August September October November December Sales in 1000 yen 400 200 150 400 100 80 160 150 200 Suppose that the monthly sales of a store follow a normal distribution. Suppose that the population mean is unknown, but the population variance is know to be 10000. Q1. Find the 95% confidence interval of the population mean. Q2. Find the 90% confidence interval of the population mean. 34 Case 2: The confidence interval of the population mean when the population variance is unknown In case 1, we assumed that the population variance was known. In reality, we rarely know the population variance. Now, in case 2, we consider the situation where the population variance is unknown. Thus it is a more realistic case. 35 In case 1, the confidence interval was derived from the fact that Z X / n follows N(0,1) when σ is known. (see the proof for case 1) For case 2, we replace σ with the sample standard deviation s. 36 When we replace σ with the sample standard deviation s, we have the following. X t ( A) s/ n where sample standard deviation s is defined as s n 1 2 ( X X ) i n 1 i 1 It is known that (A) has t-distribution with degree of freedom equal to n-1. 37 t -distribution Standard Normal (t with df = ∞) t (df = 13) t-distributions are bell-shaped and symmetric, but have ‘fatter’ tails than the normal t (df = 5) 0 Chap 8-38 t A notation Let tv be the random variable having tdistribution with v degree of freedom. Let 100(1- ) be the confidence level. Then the number tv, /2 is defined as the number satisfying the following. P(tv> tv, /2 )= /2 39 Definition of t v,α/2 t-distribution with degree of freedom equal to v. /2 0 tv, α/2 Thus, tv,α/2 is a cutoff point where right tail probability is equal to α/2. 40 How to find t v,α/2 Cutoff Point for student's t-distribution df .10 .05 .025 1 3.078 6.314 12.706 See textbook 866, Table 8 For example if v=2 = 0.1 /2 =.05 2 1.886 2.920 4.303 3 1.638 2.353 3.182 The body of the table contains the cut off point, not probabilities /2 = .05 t02,0.05=2.920 t 41 Exercise 5 Find tv, /2 when v=40 and =0.05. 42 Case 2: The confidence interval for the population mean when the population variance is unknown Let X1, X2,…, Xn be a random draw from the normal distribution with unknown population mean, and unknown population variance. Then 100*(1- α) confidence interval for the unknown population mean μ is given by x t n-1,α/2 S S μ x t n-1,α/2 n n Where S is the sample standard deviation Proof (Important): See the front board 43 Exercise 6 Suppose that the monthly Sales in Month sales of a store follow a 1000 yen normal distribution with April 400 unknown mean and unknown May 200 June 150 variance July August September October November December 400 100 80 160 150 200 Q1. Find the 95% confidence interval of the population mean. 44 Hypothesis testing 45 Hypothesis testing Statistical theory provides you with a scientific way to test a hypothesis. Hypothesis testing is an important part of decision making. For example you can test the following: 1.Whether or not the population mean of the stock return in a particular industry is 8% 2. If a weight reduction program has any real effects. 46 Null hypothesis and alternative hypothesis The first step to conduct a hypothesis testing is to develop appropriate (i) null hypothesis and (ii) alternative hypothesis. In the following, I will provide two examples 47 [Example 1] If you want to test if the population mean of the stock return in a particular industry is 8%, then we have two hypotheses The null hypothesis H0: μ=8% The alternative hypothesis H1: μ≠8% In hypothesis testing, you test the null hypothesis against the alternative hypothesis. You should develop an appropriate set of two competing hypothesis 48 [Example 2] Suppose you run a weight reduction program. You have the weight reduction data for each client. Then you may want to test if your weight reduction program has any real effect. Let μ be the unknown population mean of the weight reduction. Then the appropriate null and alternative hypothesis would be The null hypothesis H0: μ=0 The alternative hypothesis H1: μ>0 49 In this case, your null hypothesis means “the weight reductionprogram has no effect”, and the alternative hypothesis means “there are some real effects”. 50 Two sided test Two sided test has the following null and alternative hypothesis. H0: μ=μ0 H1: μ≠μ0 Thus, the example of stock return is an example of two sided test where μ0 = 8%. 51 Test procedure for two sided test 1. First, set the significance level α. This number should be reasonably small. It is usually set at 0.05. 2. Second, compute “t-statistic” which is defined as From null X 0 t s/ n hypothesis 3. Reject H0 if the absolute value of tstatistic is greater than tn-1,α/2 . Otherwise, do not reject (i.e, accept) H0. 52 Two sided test decision t-distribution with n-1 degree of freedom 1 /2 -tn-1,α/2 /2 tn-1,α/2 Reject H0 if t-statistic falls in the shaded region. This region is called the rejection region. You reject H0 because, if H0 is true, then t-statistic falls in this region only with a small probability:100*α%. Do not reject (i.e., accept) H0 if the t-statistic falls between 53 [-tn-1,α/2, tn-1,α/2] If the null hypothesis is rejected, you say that the null hypothesis is rejected at the 100*α% significance level. If you rejected H0, the alternative hypothesis (H1) is “accepted”. If you did not reject H0, the null hypothesis (H0) is “accepted”. 54 Note Strictly speaking, this test is valid only for the case where the population distribution is normal with unknown mean and unknown variance. If the population distribution is not normal, this does not apply. However, even if the population distribution is not normal, it is known that this test is good approximation for any arbitrary distributions. 55 Exercise 7 The excel file `Test Score’ shows the final exam scores for a particular class. The professor wanted the mean score for the final exam to be about 60. Test if the population mean of the test scores is equal to 60 at the significance level equal to 5%. 56 One sided test (upper tail test) Two sided test has the following null and alternative hypothesis. H0: μ=μ0 H1: μ>μ0 Thus, the example of the weight reduction program is an example of the one sided test. 57 Test procedure for one sided test 1. First, set the significance level α. This is usually set at 0.05. 2. Second, compute “t-statistic” which is defined as From null t X 0 s/ n hypothesis 3. Reject H0 if t-statistic is greater than the criteria value tn-1,α. Otherwise, do not reject (i.e., accept) H0. 58 One sided test decision T-distribution with n-1 degree of freedom 1 tn-1,α Reject H0 if the t-statistic falls in the shaded region. This region is called the rejection region. You reject H0 since, if H0 is true, t-statistic falls in this region only with a small probability: 100* α%. Do not reject (i.e., accept) H0 if the t-statistic is smaller than tn-1,α/2 59 Weight reduction in kg 5 0 4 3 -0.5 1 0 3 5 5 0 0 Exercise 8 You run a weight reduction program. This data shows the weight reduction data for each of your client in kilogram. Positive number means a reduction in weight. Negative means an increase in weight. Test, at 5% significance level, whether this program has indeed reduced the weights of your clients. The data is stored in file `Weight Reduction’ 60 Testing the difference in the population means between two different samples We are often interested to see if there are any differences between two groups (Female v.s. male etc). 61 In this section, I will show how to examine if there is any difference in the population means of two different groups. Assumptions (i) The distributions for both group are normal. (ii) The population means may be different but the population variances are the same between the two groups. 62 Suppose that you have two sets of data, one for group X and the other for group Y. Then you may be interested in the following tests Example: One sided test H0: Female students (X) and male students (Y) have the same average H0: μx-μy=0 test score H1: μx-μy>0 H0: female students have higher Two sided test test score. H0: μx-μy=0 H1: μx-μy≠0 63 Testing procedure I’m going to describe only the following one-sided test. H0: μx-μy=0 H1: μx-μy>0 Let next nx and ny be the number of observations for each group. Let Sx and Sy be the sample standard deviations for group X and group Y. 64 1. Construct the `pooled sample variance’ as (nx 1) S x (n y 1) S y 2 Sp 2 2 nx n y 2 2. Then construct the t-statistic as t (X Y) 0 Sp 2 nx Sp 2 Just to emphasize that this is from the H0: μx-μy=0 ny Then t-statistic follows t distribution with (nx+ny-2) degree of freedom. 65 Reject the null hypothesis H0 if the tstatistic is greater than tnx+ny-2,α Do not reject (accept H0) if otherwise. 66 Exercise 9 File `Test scores’ shows the test score for a final exam for a particular class. This file also contains information about the students’ information, such as gender or the use of office hours. Answer the following questions. 67 Q1. Do students who make a good use of office hours perform better than those who have never used office hours? Test this at significance level 5%. Q2. Is there any difference in test scores between male students and female students? Test this at significance level 5%. 68