Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sufficient statistic wikipedia , lookup
History of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Misuse of statistics wikipedia , lookup
Taylor's law wikipedia , lookup
8.1 Chapter 8: Fundamental Sampling Distributions and Data Descriptions Take Sample Sample Inference Population We have spent the last three chapters discussing quantities in the population (PDFs, E(X), Var(X),…). Now, we are going to continue talking about population quantities, but also how to take a sample and the summarizing the sample itself. Understanding this chapter will be key to understanding how we make the inferences from the sample to the population! From this chapter, it is important to learn the following: Understand all components of the diagram above. While we will not necessarily be doing the inference part in this chapter, you will have a basic idea how this will be done. Sample mean and variance Sampling distributions – what they are, how they can be used, the central limit theorem Chi-squared PDF and t-PDF 2005 Christopher R. Bilder 8.2 8.1-8.2: Random Sampling and Some Important Statistics Below are part of the notes from Chapter 1. Parts have been added to update the notes relative to what we have discussed so far this semester. Most of what we will be doing in this course centers around trying to understand a set of information (data). This set of information is from ALL objects in a population. Often, this set of information is so big that obtaining all of the information about these objects is extremely difficult. Instead, this information may be hard to obtain making it impossible to obtain all of it. To understand the population, we usually will take a representative subset of information from the population. The subset is a smaller set of information that is usually more manageable. The objects from the population used in this subset form the sample. Typically, we want the objects in the sample to be very similar (representative) to objects in the population. 2005 Christopher R. Bilder 8.3 Example: Suppose we are interested in estimating the average GPA of all students at UNL. How would we do this? Assume we do not have access to any student records. 1. Define the population of interest The population is all UNL students Problems obtaining information on all of the students: >20,000 students Students drop out and enroll late 2. Define a characteristic or random variable of interest Let X denote GPA 3. Define the parameter of interest Parameter: Numerical summary measure used to describe a population characteristic. The parameter is the population average here. The average is often called the “mean” and denoted by the greek letter “mu”, . There could be other parameters as well that we are interested in. For example, the variance or standard deviation. 2005 Christopher R. Bilder 8.4 3.6 2.4 2.7 2.8 3.9 Population 3.4 1.2 2.9 3.2 4.0 To find the average GPA, we could add up ALL of the GPAs of students in the population and divide by the total number of students. Why would this be difficult to do? 4. The role of PDFs By making the assumption that the student GPAs follow a particular PDF, we can often simplify the problem to some respect. For example, suppose we could assume the GPA PDF is as plotted below: 2005 Christopher R. Bilder 8.5 GPA population probability distribution example 0.7 0.6 0.5 f(x) 0.4 0.3 0.2 0.1 0 0 1 2 3 4 x (GPA) GPA Probability Distribution Note that the above PDF is for a random variable X = 4Y where Y has a Beta PDF with =5 and =2 (see #12 on p. 175). The PDF of Y is ( ) 1 y (1 y)1 0 y 1 f(y) ()() 0 otherwise By making the assumption about the PDF, we can somewhat simplify the problem of examining the GPAs. For example, we can find the E(X) = and Var(X) = 2 using properties discussed in Chapter 4. Note that for this PDF, E(X) = E(4Y) = 4/(+) = 45/(5+2) = 20/7 = 2005 Christopher R. Bilder 8.6 2.8571 and Var(X) = Var(4Y) = 42/[(++1)(+)2] = 4252/[(5+2+1)(5+2)2] = 0.4082. Questions: How realistic is this PDF assumption? How could we check if this is a reasonable assumption? What can we do if this assumption is not correct? In real-life applications, , , , 2 parameters are never really known. Thus, we need to take a sample in order to estimate them. 5. Define the sample Suppose a representative sample of 20 students is taken from the population. Each student’s GPA is a random variable denoted by X. The particular student’s which are chosen out of the population have random variable GPAs denoted by X1, X2,…, X20. Each of these random variables, have the PDF of f(xi) for i=1,…,20. Once we know what the actual GPAs are for these students, say 2.9, 3.4,… , we have the observed values of the random variables since we are “observing” a particular student GPAs. These observed values are denoted by x1, x2, …, x20. The observed values are also called observations. 2005 Christopher R. Bilder 8.7 6. Define the statistic Statistic: Numerical summary measure used to describe a sample characteristic. Note: A statistic estimates a parameter The statistic is the sample mean here which estimates the population mean, . The sample mean is 20 X Xi i 1 20 . When the actual GPAs are observed, then 20 x xi i 1 20 is a symbolic way to denote an actual numerical value. Please remember the discussion about random variables (X – denoted by capital letters) and their observed values (x – denoted by lowercase letters) in Chapter 3. I know that this can be confusing! There are other statistics me may want to calculate as well. For example, the sample variance is 2005 Christopher R. Bilder 8.8 20 S2 (Xi X) 2 i 1 20 1 and it estimates the population variance, 2. There are also statistics which estimate and . One way to derive these statistics is through maximum likelihood estimation which is the subject of Section 9.15. 7. Random sample How do we take the sample? Random sample: Select n items from a population where each has an equal chance of being chosen. There are other ways to take the sample, but each of them is interested in being representative of the population. 2005 Christopher R. Bilder 8.9 Take Sample 3.6 2.4 2.7 2.8 2.9 Sample 3.4 3.9 2.8 Population 3.4 1.2 2.9 3.6 3.2 4.0 X In the above diagram, I could not fit all >20,000 GPAs in the population and all 20 GPAs in the sample! By taking a random sample, this ensures that X1 is independent of all the other Xi’s. Similarly, X2 is independent of all of the other Xi’s and so on. Using the independence property discussed in Section 3.4, the joint PDF for X1, X2,…, X20 is f(x1, x2,…,x20) = f(x1)f(x2)f(x3)…f(x20). In terms of the GPA problem, what does it mean for X1 to be independent of X2 to be independent of X3 ….? 8. Inference 2005 Christopher R. Bilder 8.10 Inference: A deduction or conclusion about the population based on the sample. Based upon the statistic in the sample, we will make inferences about the parameter in the population with a certain level of accuracy. This level of accuracy can be made through the use of probability. We will be begin to discuss how this is done in this chapter! Take Sample 3.6 2.4 2.7 2.8 2.9 Sample 3.4 Inference 3.9 2.8 Population 3.4 1.2 2.9 3.6 3.2 4.0 X Questions: The sample mean GPA, X , estimates the population mean GPA, . Is the sample mean GPA equal to the population mean GPA? 2005 Christopher R. Bilder 8.11 How accurate is the sample mean GPA in estimating the population mean GPA? The statistical science allows us to measure the accuracy. In this chapter we will start to learn how to measure this accuracy. Suppose another random sample is taken. Is the sample mean GPA going to be the same? Is the sample mean a random variable? What would happen if a random sample is not taken? Suppose only College of Business students are sampled. 2005 Christopher R. Bilder 8.12 Important statistics and definitions Definition 8.4 – Any function of the random variables constituting a random sample is called a statistic. Definition 8.5: If X1, X2, …, Xn represent a random sample of size n, then the sample mean is defined by the statistic n X Xi i 1 n Notes: This statistic estimates the population mean, . This measures “central” location of all possible values of the random variable. Other measures of central location include the median (50% of all values are less than and 50% are greater than) and the mode (most frequent value). Definition 8.6: If X1, X2, …, Xn represent a random sample of size n, then the sample variance is defined by the statistic n S2 (Xi X) 2 i 1 n 1 Notes: 2005 Christopher R. Bilder 8.13 This statistic estimates the population variance, 2. The sample standard deviation, S, is the positive square root of S2. See p. 199 for a re-expression of the formula (I recommend against using it due to numerical inaccuracies which can occur when using it). Where does this formula come from? o Remember that 2 = E[(X-)2]. Put into words, this is the expected average squared deviation of the random variable X from the population mean. o Notice that (Xi X)2 part plays the role of (X-)2. n o The ___ (n 1) part plays the role of E[ ]. i 1 Remember in the discrete random variable case, we sum over ALL possible values of X to find E[ ]. For the sample, we are going to sum over all values observed. Since the “expected value” is what we would expect on average to happen (remember that f(x) is like a weight), we divide by n-1 in S2 to find the average squared deviation. The reason for dividing by n-1 instead of n will be discussed in Section 9.3. I usually will put a problem on an exam which asks you to compute this quantity by hand. Side note: All or most observed values should be 2 to 3 standard deviations from the mean. 2005 Christopher R. Bilder 8.14 X 2S or X 3S Example: Sample mean and variance GPA (sample_mean_var_GPA.xls) Below is an actual sample of GPAs where the PDF is defined on p. 8.5. Thus, x1 = 1.656, x2 = 1.417, …, x20 = 3.375 GPAs 1.656 1.417 2.810 3.328 3.745 3.325 3.338 2.899 3.549 3.426 2.726 2.186 3.044 3.385 3.678 2.721 3.351 3.069 2.424 3.375 (xi-xbar)^2 Simple calculations Sample size 1.733 20 Sample mean 2.973 2.420 Sample variance 0.4070 0.026 Sample s.d. 0.6380 0.126 0.597 0.124 Functions in Excel 0.134 Sample mean 2.973 0.005 Sample variance 0.4070 0.332 Sample s.d. 0.6380 0.205 0.061 Rule of thumb lower upper 0.618 2 s.d. 1.697 4.249 0.005 3 s.d. 1.059 4.886 0.170 0.498 0.063 0.143 0.009 0.301 0.162 2005 Christopher R. Bilder 8.15 x 1.656 1.417 20 n s 2 (xi x) i 1 2 3.375 2.973 (1.656 2.973)2 (3.375 2.973)2 20 1 n 1 0.4070 s s2 0.4070 0.6380 Rule of thumb with 2 standard deviations: x 2s 2.973 2 0.6380 1.697 x 2s 2.973 2 0.6380 4.249 Examine how the above range for GPAs corresponds to the plot on p. 8.5. Of course, GPAs can not be greater than 4. Also, notice how 18 of 20 observations fall within x 2s and 20 of 20 observations fall within x 3s . Below is a screen capture of the formulas used in Excel to calculate these quantities. Note that Chris Malone’s Excel Instructions website contains help on some of these functions. For example, http://www.statsteacher.com/excel/analyses/mean.html contains information about the AVERAGE() function. 2005 Christopher R. Bilder 8.16 2005 Christopher R. Bilder 8.17 8.3: Data Displays and Graphical Methods This section describes ways to view the observed values from the sample graphically. This helps to understand the distribution, mean, variance, and other summary measures calculated based on the observed values. Histograms A plot of the frequency distribution of the observed values in the sample. Example: Sample mean and variance GPA (sample_mean_var_GPA.xls) Classes 0 0.5 1 1.5 2 2.5 3 3.5 4 Bin 0 0.5 1 1.5 2 2.5 3 3.5 4 More Frequency 0 0 0 1 1 2 4 9 3 0 Be very careful with interpreting these classes. For example, “Classes = 4” means >3.5 and 4. Also, “Classes = 3.5” means >3 and 3.5. 2005 Christopher R. Bilder 8.18 or e M 4 5 3. 3 5 2. 2 1 5 1. 0. 5 10 9 8 7 6 5 4 3 2 1 0 0 Frequency Histogram for GPA sample GPA Also, be very careful with lining up these bars with the classes! See my red arrows above for help. Compare this to the GPA PDF on p. 8.5. Are their shapes similar? Box plots There are a few different ways to draw these types of plots. Below is an example set of box plots for four different samples (not necessarily from the sample population) to be used for our definition of a box plot. 2005 Christopher R. Bilder 8.19 Notes: The sample 25th percentile is the value such that approximately 25 percent of the observed values are below it and 75 percent are above it. The value is often denoted as Q1. See p.204 for more information on its exact calculation. The sample median (50th percentile) and sample 75th percentile = Q3 are similarly defined. The “box” in the middle of each box plot shows the range of the middle 50 percent of the observed values. The sample mean is also plotted. Notice that it does not necessarily need to equal the sample median. Lines to the left of the box and to the right of the box are drawn out to values as shown above. Most observed values are expected to fall within this range. This serves a similar purpose as the rule of thumb for the number of standard deviations all data lies from its mean. 2005 Christopher R. Bilder 8.20 Observed values outside of horizontal lines are often called outliers since they are outside of the range we would expect them to fall within. Dot plots Below is an example dot plot. Each plot symbol denotes an observed value. These values are “jittered” in the vertical direction to help avoid overlapping. Unfortunately, Excel does not have an easy way to create these plots. The box_dot_plot.xls file serves as a “template” for drawing box and dot plots. The file can create box and dot plots for up to four different samples with sample sizes less than 500. This file was created by myself and Chris Malone from Winona State University. 2005 Christopher R. Bilder 8.21 Example: Sample mean and variance GPA (sample_mean_var_GPA.xls, box_dot_plot.xls, and data_summary.xls) I copied and pasted the observed values from sample_mean_var_GPA.xls into the DATA sheet of box_dot_plot.xls. I deleted the data which was in columns 2-4. The plots are on the box and dot plots sheet. 2005 Christopher R. Bilder 8.22 Notes: Notice how the two observed values outside of the X 2S range show up as outliers here! Compare the dot plot to the histogram on p. 8.18. Be careful about the scale of the x-axis on both plots. Typically, you will want to make them exactly the same so that you can compare the two plots! 2005 Christopher R. Bilder 8.23 Below are the results from using data_summary.xls Note that the box plot drawn here is done a little differently. No outliers are shown since the dot plot gives that information. The lines on both sides of the box plot are only drawn out to the limits shown on p. 8.19 or to the smallest or largest value within the limits. Notice the right side line is drawn only out to the maximum value. The left side line is drawn out to the full limits. 2005 Christopher R. Bilder 8.24 Example: Investments (Data only contained in invest.xls) Below are box and dot plots for monthly investment returns from May 1991 to May 2001 for three different stock indexes. For example, a value of 0.10 means a 10% profit was made for a particular month. The different sizes of the outliers do not mean anything (error in file). Which investment has more variability? Which has the larger mean return? 2005 Christopher R. Bilder 8.25 8.4-8.5: Sampling Distributions and the Sampling Distribution of Means Suppose X1, X2, …, Xn is a random sample from a population with PDF possibly unknown. Also, suppose E(Xi) = and Var(Xi) = 2 for i = 1, …, n. What is E(X) and Var(X) ? n Xi 1 n 1n 1n i1 E(X) E E Xi E(Xi ) i 1 n i1 n i1 n n n Xi 1 n 1 n i1 Var(X) Var 2 Var Xi 2 Var(Xi ) i 1 n i1 n n 1 n 2 2 2 n i1 n Notes: If it is not clear why the above statements are true, please review Section 4.3 of the notes. Let’s examine E(X) = more closely. o Suppose a random sample was taken of size n and X was found. Most likely, it will not be exactly equal to , but you would expect it to be somewhat close. o Suppose another random sample was taken of size n and X was found. Most likely, it will not be 2005 Christopher R. Bilder 8.26 exactly equal to or the past X , but you would expect it to be somewhat close. o Suppose this process of taking random samples of size n and calculating X was repeated an infinite number of times. If you were to find the average value of ALL of these X ’s, it would be . o We will examine a computer simulation of this process shortly. Let’s examine Var(X) = 2/n more closely. o The larger the sample size, the SMALLER Var(X) becomes. Why? o We will examine a computer simulation of why Var(X) = 2/n shortly. Often, you will see the use of X to mean E(X) and 2X to mean Var(X) . This corresponds to the notation introduced in Section 3.4 when we had multiple random variables. Central Limit Theorem So far, we have been concerned about the PDF for a random variable X. Remember the example shown in Chapter 6 (p. 6.41) of what can happen if the PDF assumption is wrong. We have examined ways to help determine if this PDF is true or to fix the assumption (i.e. look at a histogram, change the parameter values of the PDF). 2005 Christopher R. Bilder 8.27 There are other ways to get around making an assumption about the PDF for X. Instead, we can take a sample from a population (without knowing the PDF) and calculate a statistic. By the Central Limit Theorem (CLT), we can use a normal PDF approximation to the statistic’s PDF – NO MATTER WHAT THE PDF WAS FOR THE POPULATION!!! The CLT is one of the most important concepts to take from this course!!! Below is the CLT expressed in terms of one statistic, the sample mean. Central Limit Theorem – If X is the mean of a random sample of size n taken from a population with mean and finite variance 2, then the limiting form of the PDF for Z X / n as n, is the standard normal PDF. Equivalently, one can say that X has an approximate normal PDF with mean and variance 2/n for a large sample size. Questions: 1) What does this CLT mean? 2005 Christopher R. Bilder 8.28 No matter what the PDF for the Xi (i=1,…,n), X has approximately a standard normal PDF provide the sample size, n, is sufficiently large. Probabilities involving X can be found with the normal PDF in a similar way as probabilities were found for one random variable, X, in Chapter 6. If the sample size, n, is not sufficiently large enough, the CLT still works if we can assume each Xi has the same normal PDF. 2) How large of a sample size is needed for the CLT to work? This is dependent on the PDF for the Xi (i=1,…,n). As a general rule of thumb, n30 should work for most PDFs for the Xi (i=1,…,n). However, smaller sample sizes may work as well. 3) Why is this formula for Z used? In Chapter 6, we showed that a normal random variable, X, with E(X) = and Var(X) = 2 could be transformed to a STANDARD normal random variable, Z, using Z = (X-)/. Remember that E(X) = and Var(X) = 2/n. Thus, using the same type of transformation as described in the last bullet, we obtain Z = X / n 4) Why does X have a PDF? X is a random variable since it is the average of other random variables. 2005 Christopher R. Bilder 8.29 Note that X varies from sample to sample to sample. One can quantify how these X 's are "distributed" (possible values they can take on) using a PDF! 5) The PDF of a statistic is often called a sampling distribution since it comes about through taking a sample from a population. 6) There is still one problem with the CLT – you need to know and 2. How to get around this problem will be discussed in future sections and chapters. 7) Many, many other statistics can have their PDF approximated by a CLT. Let Yi = 0 or 1 with E(Yi) = p for i=1,…,n where each Yi are independent (thus, the Yi are a Bernoulli random n variables). Then Y Yi n = P̂ is the sample i 1 proportion of 1’s or successes. Note that E(Y) = p and P̂ p can be Var(Y) = p(1-p)/n. Thus, Z p(1 p) / n approximated by a standard normal PDF. This is the same result as shown in Section 6.5 (divide the expression in that section by n in the numerator and denominator). Theorem 8.3 – to be discussed later. Example: UNL GPA 1,000 samples (CLT_GPA_ex.xls) The following is done in the Excel file: 2005 Christopher R. Bilder 8.30 1,000 samples of size 20 (assume the CLT holds) are taken from the population with PDF shown below. GPA population probability distribution example 0.7 0.6 0.5 f(x) 0.4 0.3 0.2 0.1 0 0 1 2 3 4 x (GPA) GPA Probability Distribution Each row in the SAMPLES sheet represents a sample of size 20. There are 1,000 rows. Remember that = 2.8571 and 2 = 0.4082. The sample mean is calculated for each sample so that there are 1,000 X ’s A histogram is constructed of the 1,000 sample means to determine if the X ’s PDF is approximately normal (as the CLT says it should be). This histogram below is found in the HISTOGRAM FOR X_BAR sheet: 2005 Christopher R. Bilder 8.31 Simulated Distribution of X_bar 160 140 Frequency 120 100 80 60 40 20 Classes The histogram below comes from using data_summary.xls with the 1,000 X ’s. 2005 Christopher R. Bilder 3.9 3.6 3.3 3 2.7 2.4 2.1 1.8 1.5 1.2 0.9 0.6 0.3 0 0 8.32 A normal PDF with mean 2.856 and variance 0.0211 is plotted on the histogram. The mean and variance of the 1,000 X ’s are given to see if they are approximately equal to E(X) = and variance Var(X) = 2/n. Note that if ALL possible samples of size 20 were taken, the mean and variance of all of them would be E(X) and Var(X) . In this case, we are just taking 1,000 samples. For this particular example, it was shown on p. 8.5 that = 2.8571 and 2 = 0.4082. Thus, E(X) =2.8571 and Var(X) = 2/n = 0.4082/20 = 0.02041. Below are the means and variances of the 1,000 X ’s. These are calculated on the E(X_BAR) AND VAR(X_BAR) sheet. Mean Variance Standard Deviation Min Max Number of means 2.8560 0.0211 0.1453 2.3472 3.3083 1000 The probability of X being between 3 and 4 is approximated from the 1,000 X ’s. Thus, the proportion of X ’s between 3 and 4 is found. The proportion is 0.165 (see the use of =(COUNTIF(U17:U1016,"<4") 2005 Christopher R. Bilder 8.33 COUNTIF(U17:U1016,"<3"))/1000 in the SAMPLES sheet) The CLT says the probability that X is between 3 and 4 can be approximated by a normal distribution with mean and variance 2/n. P(3< X <4) = P( X <4) – P( X <3) can be found with the Excel function: NORMDIST(4,2.8571,SQRT(0.4082/20),TRUE)NORMDIST(3,2.8571,SQRT(0.4082/20),TRUE) The resulting probability is 0.1586. Equivalently, one can find this probability through P(3< X <4) 3 2.8571 X 4 2.8571 = P 2 0.4082 / 20 /n 0.4082 / 20 = P 1.0003 Z 7.9999 = P(Z<7.9999) – P(Z<1.0003) = 1-0.8414 = 0.1586 Transforming the probability to be in terms of a standard normal random variable, Z, used to be always done when using standard normal distribution tables. This is no longer needed now. Notice how close the probability resulting from the CLT is to the probability resulting from the simulated PDF of 2005 Christopher R. Bilder 8.34 X . Thus, the CLT allows us to calculate these probabilities without taking 1,000 samples of size 20, finding the mean, finding the variance, … . Of course, taking 1,000 samples of size 20 is not feasible for the vast majority of real-life applications! Since we thoroughly discussed finding probabilities with the normal PDF in Chapter 6, many of the same techniques with finding these probabilities apply here. Remember the main advantage of using X is that you do not need to know the PDF for X! Example: Healthy Choice (health_choice.xls) Healthy Choice claims that it fills boxes on average with 24 oz. of cereal and the standard deviation is 2 oz. of cereal. Suppose a FDA official wants to find out if boxes of Healthy Choice cereal have the advertised weight of 24 oz.. The FDA official random samples 36 boxes of cereal. Suppose Healthy Choice is making cereal with =24 oz. of cereal and =2 oz. of cereal. 1) What is the approximate probability the sample mean weight is greater than 23 oz.? 2005 Christopher R. Bilder 8.35 Notice that nothing is said about the PDF for each box here!!! Normal PDF for mean=24 and s.d.=0.3333 1.4 1.2 f(X_bar) 1 0.8 0.6 0.4 0.2 0 22 23 24 25 26 X_bar Note that = 24 and /n = 2 / 36 =0.3333. Find P( X >23) = 1 – P( X <23). The Excel function is 1-NORMDIST(23,24,0.3333,TRUE) The resulting probability is 0.9987. 2) What is the approximate probability the sample mean weight of the boxes is between 23 and 25 oz.? Find P(23< X <25). The Excel function is NORMDIST(25,24,0.3333,TRUE)NORMDIST(23,24,0.3333,TRUE) The resulting probability is 0.9973. 2005 Christopher R. Bilder 8.36 3) The company will be fined if the sample mean weight of the boxes is not within 1 oz. of the advertised true mean. What is the approximate probability the company will receive a fine? Find P( X <23 or X >25). This is 1-P(23< X <25) = 1-0.9973 = 0.0027. The following theorem is another way to express the CLT, but for a different statistic. Theorem 8.3: If independent samples of size n1 and n2 are drawn at random from two populations, discrete or continuous, with means 1 and 2 and variances 12 and 22 , respectively, then the sampling distribution of the difference between sample means, X1- X2 , is approximately a normal PDF with mean and variance given by: 12 22 E X1 X 2 1 2 and Var X1 X2 . n1 n2 X1 X2 1 2 Hence, Z is approximately a standard 2 2 1 2 n1 n2 normal random variable for large n1 and n2. Equivalently, one can say that X1- X2 has an approximate normal PDF with 12 22 mean 1-2 and variance for large samples. n1 n2 2005 Christopher R. Bilder 8.37 Notes: n130 and n230 is usually a large enough sample so that the CLT holds. It is often of interest to compare to population means to see which one is larger. This result will be used a lot in Section 9.8 and 10.8. Probabilities can be found using the normal PDF here in a similar manner as done when there was only one sample mean. Please see the textbook for examples if you are not for sure how to exactly. Question: Why isn’t there a covariance term in 12 22 Var X1 X2 ? n1 n2 2005 Christopher R. Bilder 8.38 8.6: Sampling Distribution of S2 The chi-square PDF is another PDF which is often used in statistics. Below is its definition from Section 6.8: Chi-squared PDF – The continuous random variable X has a chi-squared PDF, with degrees of freedom, if its PDF is given by 1 / 21 x / 2 x e for x>0 /2 f(x) 2 ( / 2) 0 otherwise where is a positive integer. Mean and variance of chi-squared random variable: E(X) = = and Var(X) = 2 = 2 Notes: The chi-squared PDF is a gamma PDF with =/2 and =2. is a parameter. Different shapes of the PDF result from different values of . The reason why is called the “degrees of freedom” will be explained shortly. Chi-squared PDF could equivalently be expressed as 2 PDF. We often want to find quantiles or percentiles from this PDF. For example, the c value which results from P(X<c) = 0.95 is called the 0.95 quantile and the 95% 2005 Christopher R. Bilder 8.39 percentile for the PDF. Thus, the area to the LEFT of c is 0.95 underneath the curve. Symbolically, this c value 2 is typically denoted as 0.05, using the notation from our book. Be very careful with this notation since the area to the RIGHT is given in the subscript. Example: Chi-squared PDF (chi_square_dist.xls) chi_square_dist.xls is an interactive file which allows you to see the PDF for different values of degreed of freedom. To find the quantile, the Excel function is CHIINV(area to the right, degrees of freedom). 2005 Christopher R. Bilder 8.40 To find probabilities, the Excel function is CHIDIST(x, degrees of freedom) which gives P(X>x) = ___. Notice this is probability is the opposite of what most of the other functions in Excel discussed so far give for continuous PDFs. For example, GAMMADIST(x, /2, 2, TRUE) gives P(X<x). To find f(x), there is no specific function. Instead, the relationship with the gamma PDF can be used and GAMMADIST(x, /2, 2, FALSE) will find f(x) for a specified . Just as a reminder, remember how one would find these probabilities using regular integration! For example, 1 P(X>15.98717) = 10 / 2 x10 / 21e x / 2dx 0.1. In (10 / 2) 15.99 2 Maple, > assume(x>0); > assume(nu>0,nu::integer); > f(x):=1/(2^(nu/2)*GAMMA(nu/2)) * x^(nu/2-1) * exp(-x/2); f( x~ ) := x~ ( 1/2 1 ) ( 1/2 x~ ) e ( 1/2 ) 1 2 2 > int(eval(f(x),nu=10),x = 15.98717..infinity); 2005 Christopher R. Bilder 8.41 .1000002634 > 1-stats[statevalf, cdf, chisquare[10]](15.98717); .1000002634 Table A.5 on p. 674-5 give some of the probabilities for a specified . Below is part of the table for =10 10 0.995 2.156 0.99 2.558 0.98 3.059 0.975 3.247 0.95 3.940 0.9 4.865 0.8 6.179 0.75 6.737 0.7 7.267 0.5 9.342 0.3 0.25 0.2 0.1 0.05 0.025 0.02 0.01 0.005 0.001 11.781 12.549 13.442 15.987 18.307 20.483 21.161 23.209 25.188 29.588 You are not responsible for knowing how to use this table. In addition to obtaining the PDF for X , we may want the PDF for S2, the sample variance. In order to do this, we do need to make the assumption that X1, X2, …, Xn are a random sample from a normal PDF with E(X) = and Var(X) = 2. Below is the result: 2005 Christopher R. Bilder 8.42 Theorem 8.4: If S2 is the variance of a random sample of size n taken from a normal population having the variance 2, then the statistic n (X X)2 (n 1) i 2 n (X X)2 (n 1)S 2 i 1 n 1 i i 1 2 2 2 Has a chi-squared PDF with =n-1 degrees of freedom. Pf: Unfortunately, there are important theorems which we skipped in Chapter 7 which are needed here. One of the theorems say that if X1, X2, …, Xn are normal random variables with the same mean and variance, then Y = X1+X2+…+Xn is also a normal random variable with E(Y) = ni1E(Xi ) = n and Var(Y) = ni1 Var(Xi ) = n2 (we already had seen the E( ) and Var( ) part from Section 4.3). Also, suppose that Zi = (Xi-)/ for i=1,…,n. Then Zi has a standard normal PDF. And, Z1 + Z2 + … + Zn has a standard normal PDF with variance n. A second important theorem showed that Zi2 has a chi-squared PDF with =1 degree of freedom. Thus, the square of a standard normal random variable is a chi-squared random variable with 1 degree of 2005 Christopher R. Bilder 8.43 freedom! And, the sum Z12 + Z22 + … + Zn2 has a chisquared PDF with =n degrees of freedom. Using the above theorems, we can note the following: Xi n n 2 Zi 2 i 1 has a chi-squared PDF with 2 i 1 =n. Notice that the numerator, Xi , can be rewritten n n 2 i 1 2 as Xi X X since X is just being added i 1 and subtracted. Multiplying this out results in X X X n Xi X X i 1 n 2 X X n Xi X i 1 n 2 2 i i 1 Xi X i 1 2 i i 1 n 2 n 2 2 Xi X X 2 X X X nX 2X X X n 2 X i 1 n i 1 i n 2 i 1 n X 2 n i 2 X 0 n n i 1 i 1 since Xi X Xi nX Xi Xi 0 . i 1 i 1 2005 Christopher R. Bilder 8.44 n 2 n Then Xi X X Xi X i 1 i 1 Also, Xi n n Z i 1 2 i n 2 i 1 (n 1) 2 n Xi X i 1 Xi X i 1 2 n X . n X 2 2 2 n 1 2 n X 2 2 X (n 1)S Thus, 2 2 / n 2 2 2 X (n 1)S 2 2 / n 2 2 2 has a chi-squared PDF (n 1)S2 with =n degreed of freedom. And, has a 2 2 X chi-squared PDF with =n-1 since has a 2 /n chi-squared PDF with =1 (Xi has a normal PDF with mean and variance 2). What are degrees of freedom? Suppose the sum of three numbers is 6. In order to know all of the three numbers, you only need to know 2 of them and the sum. For example, 2005 Christopher R. Bilder 8.45 X1 = 1 (pick) X2 = 2 (pick) X3 = 3 (cannot Vary) Sum = 6 The degrees of freedom are 2 for the sum. Similarly, the degrees of freedom are n-1 for X and S2. The chisquared PDF and other PDFs build these values into their PDF formulas as parameters. Typically, the degrees of freedom will be always known since we know the sample size. Thus, these PDFs can be easier to work with. 2005 Christopher R. Bilder 8.46 8.7: t-distribution W.S. Gosset story Theorem 8.5: Let Z be a standard normal random variable and V a chi-squared random variable with degrees of freedom. If Z and V are independent, then the PDF of the random variable T, where Z T V/ is given by the PDF of ( 1) / 2 1 / 2 t2 for -<t<. h(t) 1 / 2 This is known as the t-distribution with degrees of freedom. pf: See a book which would be used for STAT 463 or STAT 872. Example: Compare t and standard normal PDF (t_stand_norm.xls) As you can see from the plots below, the t-distribution is very similar to the standard normal PDF. The main difference is that there is more area underneath the “tails” (ends) of the t-distribution. As the degrees of freedom, , become larger, the difference between the two PDFs becomes extremely small. Often, the 2005 Christopher R. Bilder 8.47 standard normal PDF be used in place of the tdistribution when is not small. In fact, for a equal to infinity the t-distribution is the standard normal PDF! Example: Finding probabilities from a t-distribution (t_prob.xls) 2005 Christopher R. Bilder 8.48 To find P(T>t), use the TDIST(t, , 1) function in Excel. For example, P(T>1.96) with =5 degrees of freedom is 0.0536. Notes: 1) Note that TDIST(t, , 2) will provide 2P(T>t). The reason why Excel has this option will be discussed in Chapter 10. 2) Please be careful about that the function finds P(T>t), NOT P(T<t)! Again, the function finds 1-F(t), not F(t). 3) Excel will not accept a negative value of t! Instead, you need to use the symmetry of the PDF to find the probability. Thus, P(T<-1.96) = P(T>1.96). 4) Below is a drawn in version of what has been found above. To find t in P(T>t) = 0.05, the TINV(probability*2, ) function can be used. For example, to find t in P(T>t) = 0.0536 with =5, use TINV(0.0536*2, 5). BE VERY CAREFUL! Notice that I needed to multiply the probability value by 2. The reason why Excel has this property will be discussed in Chapter 10. 2005 Christopher R. Bilder 8.49 Notes: 1) We can say “1.96 is the 0.9464 quantile from a tdistribution with 5 degrees of freedom.” Also, we could say “1.96 is the 94.64 percentile from a tdistribution with 5 degrees of freedom.” 2) t in P(T>t) = is often denoted by t, for degrees of freedom and as the area to the right of it. Thus, P(T > t0.0536,5) = 0.0536. 3) t, = -t1-,. Why? Examine how to work with the t-distribution through the use of the spreadsheet below. 2005 Christopher R. Bilder 8.50 Table A.4 on p.672-3 gives some of the probabilities that one could calculate for some degrees of freedom. We will not use the table in this class. Just as a reminder, remember how one would find these probabilities using regular integration! For example, (5 1) / 2 2 5 1 / 2 1 t P(T>1.96)= dx 0.0536 . In 5 1.96 5 / 2 5 Maple, > f(t):=GAMMA((nu+1)/2)/(GAMMA(nu/2) *sqrt(Pi*nu)) * (1+t^2/nu)^(-(nu+1)/2); 2005 Christopher R. Bilder 8.51 ( 1/2 1/2 ) 1 1 t2 1 2 2 f( t ) := 1 2 > int(eval(f(t),nu=5),t=1.96..infinity); .05364397625 > 1-stats[statevalf, cdf, studentst[5]](1.96); .0536439763 The CLT says that Z X can be approximated by a / n standard normal PDF for large n (sample size). Below are some problems: 1) Typically, and will not be known. 2) What if n is small? Below is part of the solution to the problems: Corollary: Let X1, X2, …, Xn be independent random variables that are all normal with mean and standard deviation . Let n Xi X Xi 2 X and S i 1 n i 1 n 1 n 2 2005 Christopher R. Bilder 8.52 Then the random variable T X S/ n has a t-distribution with = n-1 degrees of freedom. Notes: 1) in the CLT has been replaced with S in the corollary. This makes the statistic more realistic because S can be calculated from a sample. 2) The standard normal distribution is not being used anymore. Instead, T does have a t-distribution EXACTLY provided X1, X2, …, Xn are independent random variables with the same normal PDF. No matter what the sample size, T has the t-distribution! 3) From 2, the distributional assumptions about X1, X2, …, Xn still limit us somewhat. Remember the CLT held NO MATTER what the PDF for X1, X2, …, Xn. However, the t-distribution still often serves as a nice approximation for the PDF of T in many situations. 4) Suppose n is very large, what happens to the PDF of T? Example: Healthy Choice (health_choice.xls) Healthy Choice cereal fills its boxes on average with 24 oz. of cereal and the standard deviation is 2 oz. of cereal. Suppose a FDA official wants to find out if boxes of Healthy Choice cereal have the advertised weight of 24 oz.. The FDA official random samples 36 boxes of 2005 Christopher R. Bilder 8.53 cereal. SUPPOSE the individual weight for each box is normally distributed with the same and . Suppose Healthy Choice is making cereal with =24 oz. of cereal. In the sample of size 36, the SAMPLE STANDARD DEVIATION was s=2 oz. of cereal. What is the probability the sample mean weight is greater than 23 oz.? What is different here than from p. 8.34? 1) Each box’s weight has the same normal PDF. Last time, nothing was said about the PDFs. 2) The sample gave us a sample standard deviation of s=2. Thus, we did not use the assumption of =2 as we did last time. 3) Although n30 meaning the CLT should work o.k., I am going to use the t-distribution here. P( X >23) X 23 24 = P S / n 2 / 36 = P(T>-3) = P(T<3) by the symmetry of the PDF = 1 – P(T>3) need to do in order to use TDIST() = 1 – 0.002474 = 0.997526 The Excel function used was: =1-TDIST(3,35,1) 2005 Christopher R. Bilder 8.54 On p. 8.34 we found the probability to be 0.998650 using the CLT. 2005 Christopher R. Bilder