Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The Normal Distribution and Inferential Statistics BIOL 2608 Biometrics Tabular Presentation of Data e.g.1 the number of sparrow’s nests per hectare were counted for each of 36 hectares 1 1 0 1 3 0 0 0 1 1 0 1 0 1 4 0 1 2 2 1 2 1 1 3 1 2 1 0 2 1 1 0 0 1 1 1 e.g. 1. A frequency distribution table for the number of sparrows’ nests per hectare No. of sparrows’ nests 0 1 2 3 4 No. of hectares 10 18 5 2 1 Frequency Distribution e.g.2. The particle sizes (m) of 37 grains from a sample of sediment from an estuary Define 8.2 6.3 6.8 6.4 8.1 6.3 convenient 5.3 7.0 6.8 7.2 7.2 7.1 classes (equal width) and 5.2 5.3 5.4 6.3 5.5 6.0 class intervals 5.5 5.1 4.5 4.2 4.3 5.1 e.g. 1 m 4.3 5.8 4.3 5.7 4.4 4.1 4.2 4.8 3.8 3.8 4.1 4.0 4.0 e.g. 2. Frequency distribution for the size of particles collected from the estuary Particle size (m) 3.0 to under 4.0 4.0 to under 5.0 5.0 to under 6.0 6.0 to under 7.0 7.0 to under 8.0 8.0 to under 9.0 Frequency 2 12 10 7 4 2 Frequency Histogram Frequency 15 10 5 0 3 to <4 4 to <5 5 to <6 6 to <7 7 to <8 8 to <9 Particle size (m) e.g. 2. Frequency distribution for the size of particles collected from the estuary Stem-and-leaf Displays e.g.2. A stem-and-leaf plot for the size of particles collected from the estuary Stem 3. 4. 5. 6. 7. 8. Leaf 88 523334128100 3234551187 3843830 0221 21 This can show the actual data and general shape of the distribution, so it’s useful for exploring data Cumulative Frequency Distribution e.g. 3. Cases of meningitis in England and Wales (1989), by age Age No. of cases Cumulative % CF frequency <1 673 673 25.18 1 to <2 354 1027 38.42 2 to < 3 193 1220 45.64 3 to <4 129 1349 50.47 4 to <5 79 1428 53.42 5 to < 10 204 1632 61.05 10 to < 15 144 1776 66.44 15 to < 25 345 2121 79.35 >= 25 552 2673 100.00 e.g. 4. Frequency distribution of height of the last year student (n = 52: 30 females & 22 males) 14 12 Why bimodal-like ? Frequency 10 8 6 4 2 0 >149-153 >153-157 >157-161 >161-165 >165-169 >169-173 >173-177 >177-181 >181-185 Height (cm) e.g. 4. Frequency distribution of female height (cm) of the class (n=30) 8 7 Ideal class number = 5 log10 n e.g. 5 log (30) = 7.4 6 then (max - min)/7.4 = (170 -149)/7.4 = 2.8 cm Frequency 5 4 3 2 1 0 >149-152 >152-155 >155-158 >158-161 >161-164 >164-167 >167-170 Height (cm) Normal curve f(x) = [1/(2)]exp[(x )2/(22)] 10 9 8 f(x) Frequency 7 6 5 4 3 2 1 145 150 155 160 165 Height (cm) 170 175 180 0 151 155 159 163 Height (cm) 167 171 Changing bin size can modify the histogram. Normal curve f(x) = [1/(2)]exp[(x )2/(22)] Parameters and determine the position of the curve on the x-axis and its shape. Until 1950s, it was then applied to environmental problems. (P.S. non-parametric statistics were developed in the 20th century) 0.09 0.08 Probability density Normal curve was first expressed on paper (for astronomy) by A. de Moivre in 1733. 0.10 0.07 0.06 male female 0.05 0.04 0.03 0.02 0.01 0.00 140 150 160 170 Height (cm) 180 190 f(x) = [1/(2)]exp[(x )2/(22)] 0.50 Probability density 0.40 N(10,1) N(20,1) 0.30 0.20 N(20,2) N(10,3) 0.10 0.00 0 10 20 X • Normal distribution N(,) • Probability density function: the area under the curve is equal to 1. 30 The standard normal curve = 0, = 1 and with the total area under the curve = 1 units along x-axis are measured in units Figures: (a) for 1 , area = 0.6826 (68.26%); (b) for 2 95.44%; (c) the shaded area = 100% - 95.44% Application of the Standard Normal Distribution For example: We have a large data set (e.g. n = 200) of normally distributed suspended solids determinations for a particular site on a river: x = 18.3 ppm and s = 8.2 ppm. We are asked to find the probability of a random sample containing 30 ppm suspended solids or more. Application of the standard normal distribution The standard deviation (or Z value): Z = (Xi - )/ Z = (30 - 18.3)/8.2 = 1.43 Check the Z Table (Table B2 in Zar’s book), you will obtain the probability for the samples having 30 ppm = 0.0764 or 7.64% i.e. for n = 200, more on about 15 occasions for having 30 ppm Central Limit Theorem As sample size (n) increases, the means of samples (i.e. subsets or replicate groups) drawn from a population of any distribution will approach the normal distribution. By taking mean of the means, we smooth out the extreme values within the sets while keeping x x. As the number of subsets increases, the standard deviation of the mean of the means will be reduced and the frequency distribution is very close to the normal distribution Inferential statistics - testing the null hypothesis Inferential = “that may be inferred.” Infer = conclude or reach an opinion The hypothesis under test, the null hypothesis, will be that Z has been chosen at random from the population represented by the curve. Z values close to the mean ( = 0) are high, while frequencies away from the mean decline e.g. two values of Z are shown: Z = 1.96 and Z = 2.58 From the Table B2, we have the corresponding probability: 0.025 (2.5%) and 0.0049 (0.5%) Inferential statistics - testing the null hypothesis As the curve is symmetrical about the mean, p to obtain a value of Z < -1.96 is also 2.5%; so the total p of obtaining a value of Z between -1.96 and +1.96 is 95% Likewise, between Z = 2.58, the total p = 99% Then we can state a null hypothesis that a random observation of the population will have a value between -1.96 and + 1.96. Inferential statistics - testing the null hypothesis Alternatively, we can state the null hypothesis as that a random observation of Z will lie outside the limit -1.96 or +1.96. There are 2 possibilities: Either we have chosen an ‘unlikely’ value of Z, or our hypothesis is incorrect. Conventionally, when performing a significant test, we make the rule that if Z values lies outside the range 1.96, then the null hypothesis is rejected and the Z value is termed significant at the 5% level or = 0.05 (or p < 0.05) critical value of the statistics. For Z = 2.58, the value is termed significant at the 1% level. Statistical Errors in Hypothesis Testing Consider court judgements where the accused is presumed innocent until proved guilty beyond reasonable doubt (I.e. Ho = innocent) If the accused is If the accused is innocent guilty (Ho is true) (Ho is false) Court’s decision: Guilty Wrong judgement OK Court’s decision: Innocent OK Wrong judgement Statistical Errors in Hypothesis Testing Similar to court judgements, in testing a null hypothesis in statistics, we also suffer from the similar kind of errors: If Ho is true If Ho is false If Ho is rejected Type I error No error If Ho is accepted No error Type II error Statistical Errors in Hypothesis Testing e.g. Ho = responses of cancer patients to a new drug and placebo are similar • If Ho is indeed a true statement about a statistical population, it will be concluded (erroneously) to be false 5% of time (in case = 0.05). • Rejection of Ho when it is in fact true is a Type I error (also called an error). • If Ho is indeed false, our test may occasionally not detect this fact, and we accept the Ho. • Acceptance of Ho when it is in fact false is a Type II error (also called a error). Power of a Statistical Test Is defined as 1- is the probability to have Type II error Power (1- ) is the probability of rejecting the null hypothesis when it is in fact false and should be rejected Probability of Type I error is specified as But is a value that we neither specify nor known Power of a statistical test However, for a given sample size n, value is related inversely to value Lower p of committing a Type I error are associated with higher p of committing a Type II error. The only way to reduce both types of error simultaneously is to increase n. For a given , large n will result in statistical test with greater power (1 - ). IMPORTANT NOTES If n is large enough, frequency histograms of interval and ratio measurements often approximate the normal distribution. The normal distribution is a mathematical curve whose shape and location is determined by two population parameters (,). Areas beneath the standard normal curve ( = 0, = 1) correspond to the probability of occurrence of normally distributed measurements with specific values. The probability of occurrence of specified measurements can be estimated using the Z table (Table B2 in Zar’s book). IMPORTANT NOTES The central limit theorem states that the sample mean is a normally distributed quantity. Significance testing involves setting up a null hypothesis (Ho) and then providing evidence for its acceptance or rejection. If Ho is rejected on the statistical evidence, the alternative hypothesis (HA) must be accepted. In terms of probability of occurrence, Ho is rejected if its probability value < (e.g. = 0.05 or 5%; i.e. less than a one in 20 chance that Ho is correct.) IMPORTANT NOTES Significance testing may make errors: – Rejection of Ho when it is in fact true is a Type I error (also called an error). – Acceptance of Ho when it is in fact false is a Type II error (also called a error). Power (1- ) is the probability of rejecting the null hypothesis when it is in fact false and should be rejected. For a given n, is inversely proportional to The only way to reduce both types of error simultaneously is to increase n. For a given , large n will result in statistical test with greater power (1 - ).