Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Week 1 Lecture: The Normal Distribution (Chapter 6) This bell-shaped curved is probably the most common and well-recognized distribution in all of biology. However, not all bell-shaped curves are normal. When a random variable is said to be normally distributed, it will have a mean “µ”and a standard deviation “σ” ; i.e., X ~ N(µ,σ). When X ~ N(µ,σ), then Z = X−µ ~N(0,1), which is called the “standard normal” curve. This σ Z-values gives the number of standard deviations away from µ = 0. Table B.2 gives the proportions of a normal distribution which lie beyond a specified Z-value. Because the normal distribution is symmetrical, we can use the table to find different proportions for a random variable. For any normal distribution, µ + σ contains 68.3% of the population µ + 2σ contains 95.5% of the population µ + 3σ contains 99.7% of the population. Example: X = a normal distribution of weights, with µ = 70 kg and σ = 10 kg. What if we want to find the proportion of the distribution that is greater than 80 kg; ie., P(X > 80)? First, we need to compute Z = X − µ 80 − 70 = = 1.0 . Then, we have σ 10 P(X > 80 kg) = P(Z > 1.0) = 0.1587 or 15.87%. This can also be thought of as the probability of randomly 60 (1) 70 (0) 80 (1) 1 sampling a weight, X, greater than 80 kg from a population with µ = 70 kg and σ = 10 kg. What if we want to determine: P(X > 60 kg)? Again, calculate Z = 60 − 70 = −1.0 . Then, you 10 have: P(X>60 kg) = P(Z > -1.0) = 1 – 0.1587 = 0.8413 (you can do this because the distribution is symmetrical). X = 60 (1) You can also find the proportion between two points: P(55 < X < 65) = P(-1.5 < Z < -0.5) = (0.5 – 0.0668) – (0.5 – 0.3085) = 0.2417 55 65 What if we want the value of X that represents a specified percentage of the distribution? This interval can be expressed as: µ ± Z*σ. Remember that the Z-value represents the number of standard deviations from a mean of zero. 2 So, what is the value of X that encompasses the middle 50% (Z = ±0.67 – look up a Z-value for 25%)? Xu = 70 + 0.67*10 = 76.7 kg Xl = 70 – 0.67*10 = 63.5 kg. What is the value of X that cuts off the upper 5% (Z = 1.645)? Xu = 70 + 1.645*10 = 86.5 kg What is the value of X that cuts off the lower 10% (Z = -1.28)? Xu = 70 - 1.28*10 = 57.2 kg Central Limit Theorem n The CLT is concerned with a sample mean, X = population, Sx = ∑x i =1 n i . For any n, drawn from a normal X has exactly a normal distribution with a mean = µ and a standard deviation = σ ⎞ . If the parent population is not normally distributed and n is >>> X ~ N⎛⎜ µ, σ ⎟ n ⎝ ⎠ n ⎞. large, then X ~ N⎛⎜ µ, σ ⎟ n⎠ ⎝ Thus, for any • X: E(x ) = µ 3 • Var (X ) = σ2 = n σ2 σ = = standard error of the sample mean = SE. n n If X is normally distributed (X ~ N(µ,σ)), then Z = If X is some other distribution, then Z = X −µ ~N(0,1), for any n. σ n X −µ tends towards a N(0,1) as n gets large. σ n Note the difference between the individual X and the sample mean 1. X~N(µ,σ) implies: Z = 2. X: X −µ ~ N(0,1). σ X >>> Z = X − µ is ~N(0,1) as n gets large, and is exactly ~N(0,1) if X~N(µ,σ). σ n Introducing Statistical Hypothesis Testing Statistics is largely concerned with making inferences about a population from a sample collected from that population. Most often, the inferences are made on one or more population means. Statistical inferences are constructed on a framework that includes a null hypothesis and an alternative or research hypothesis. The null hypothesis represents a condition of “no change” or “no difference” or “equality” while the alternative hypothesis represents the condition that is true if the null hypothesis is false. The null and alternative hypotheses should be stated a priori. When testing hypotheses concerning population means, sample means are calculated from randomly sampled data collected within the population. Then, the probability of that sample mean occurring given that the null hypothesis is true is calculated. This probability can be graphically represented by the proportion of the area under the curve, like we calculated earlier. This calculated probability is then compared to an objective criterion for drawing a statistical 4 conclusion about our sample mean. This comparison is stated in a “decision rule”. Thus, our criterion is used to “reject” or “not reject” the null hypothesis. This is an important concept, in that the calculated probability represents the sample data given that the null hypothesis is true, and NOT that the null hypothesis is true given the data. Thus, we never “accept” the null hypothesis as being true, we only “fail to reject” or “do not reject” the null hypothesis. This is a philosophy of “falsification”, largely codified by Karl Popper. The reason we can never “accept” a null hypothesis is because in statistics, we never actually have a complete accounting of a population. In some instances, our sample mean may actually be an extreme occurrence that creates an error in our conclusion of the statistical test. There are two types of errors that occur in statistical hypothesis testing: 1) Type I Error and 2) Type II Error. Type I Error occurs when we reject a null hypothesis that is in fact true, and a Type II Error occurs when we do not reject a null hypothesis that is in fact false. As the researcher, you set the acceptable levels of Type I and II Errors that you are willing to accept. The Type I Error rate or “significance level” or “alpha level” is often set at 0.05 or 95%, while the Type II Error rate or “1-power”or “beta level” is often set at 0.10. These values are arbitrary and can be changed. The idea of “statistical power” will be discussed in more detail later, but in summary it tells you about the reliability of your statistical test for making a conclusion about the sample mean. You typically increase power of a test by increasing the sample size. Another point about hypotheses is that they can be directional; you can specify which tail of the distribution (or both) for which you are testing. This is called “one tail” versus “two tail” testing, and it is important that you correctly state your hypotheses to reflect which or both tails you are considering. Zar provides an excellent treatment of this topic in section 6.3; it may very well be the best section in the book, in my opinion! 5 Testing for Departures from Normality The Chi-square and KS GOF tests can be used to assess whether or not a sample came from a normal population, though Zar does not recommend these methods because they have low power. Graphical methods can be used to visually assess whether the sample appears to be from a normal population. Shapiro and Wilk developed a test for normality by calculating a “W” statistic. This is one of the normality tests provided in SAS. Let’s do an example in SAS to show how you can go about testing for normality. Example: We want to test if total height (in feet) in two samples of Appalachian oaks are normally distributed. We will use the PROC UNIVARIATE routine in SAS to calculate the Shapiro-Wilk “W” statistic and test the hypotheses: Ho: The white oak sample comes from a normal distribution. Ho: The red oak sample comes from a normal distribution. Ha’s: not Ho. α = 0.05 From SAS, we get: • W (white oaks) = 0.96 (p = 0.24) >>> do not reject Ho. • W (red oaks) = 0.97(p = 0.44) >>> do not reject Ho. 6