Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
The normal distribution • is called a location parameter because it indicates where the graph is centered or positioned. • σ determines the shape of f. Standard normal distribution If a distribution is normal, one may know the percentiles of the data; given a normal distribution, 68.25%, 94.45% and 99.7% of the data will fall between +/- one, two and three standard deviation from the mean! Standard normal distribution For each different and , we will have a different distribution. But, if each distribution depends on its and , how can we compare different distributions? We would have to make a table for each and every possible and . To solve this problem, the normal distribution is converted to a standard normal distribution: a normal distribution with a mean of 0 and a standard deviation of 1. Let X be a normal random variable Z with mean , and standard deviation . The transformation express X as the standard normal random variable with = 0 and = 1 Standard normal distribution Suppose that the scores on an aptitude test are normally distributed with a mean of 100 and a standard deviation10. (Some of the original IQ tests were purported to have these parameters.) What is the probability that a randomly selected score is below 90? Standard normal distribution = (90 - 100) / 10 = -1.0 Thus a score of 90 can be represented as 1 standard deviation below the mean. P(X < 90) = P (Z < -1.0) Table C. 3 catalogues the CDF for the standard normal distribution from Z = -3.99 to Z = 3.99 in increments of 0.01. P(X < 90) = P (Z < -1.0) = 0.1587 (from Table C. 3 we use the row marked -1.0 and the column marked 0.00) Sampling distributions 三大抽样分布: t 分布 χ2分布(卡方分布) F 分布 Distribution of the sample mean When sampling from a normally distributed population with mean and variance , the distribution of the sample mean (sample distribution) will have the following attributes: 1. The distribution of ’s will be normal. 2. 3. is the population standard error of the mean Distribution of the sample mean The mean blood cholesterol concentration of a large population of adult males (50-60 years old) is 200 mg/dl with a standard deviation of 20 mg/dl. Assume that blood cholesterol measurements are normally distributed. What is the probability that a random selected individual from this age group will have a blood cholesterol level below 250 mg/dl? Solution. Apply the standard normal transformation. P (X < 250) = P (Z < ) = P (Z < 2.5) = F (2.5) = 0.9938 (Table C.3) Distribution of the sample mean What is the probability that a random selected individual from this age group will have a blood cholesterol level above 225 mg/dl? Solution. Apply the standard normal transformation. P (X > 225) = P (Z > ) = P (Z > 1.25) = 1-F (1.25) = 1-0.8944 = 0.1056 Distribution of the sample mean What is the probability that the mean of a sample of 100 men from this age will have a value below 204 mg/dl? Solution. = 200 mg/dl = 2.0 mg/dl P( < 204) = P (Z < ) = P (Z < 2.0) = F (2.0) = 0.9772 Distribution of the sample mean If a group of 25 older men who are strict vegetarians have a mean blood cholesterol level of 188 mg/dl, would you say that vegetarianism significantly lowers blood cholesterol levels? Explain. Solution. = -3.0 P( < 188) = P (Z ≤ -3.0) = F (-3.0) = 0.0013 (Table C.3) Diet may affect blood cholesterol levels. Distribution of the sample mean Portions of prepared luncheon meats should have pH values with a mean of 5.6 and a standard deviation of 1.0. The usual quality control procedure is to randomly sample each consignment of meat by testing the pH value of 25 portions. The consignment is rejected if the pH value of the sample mean exceeds 6.0. What is the probability of a consignment being rejected? Solution. = 2.0 P( > 6.0) = P (Z > 2.0) =1- F (2.0) = 1-0.9772 = 0.0228 Only 2.28% of the consignment will be rejected using the quality control procedure above. Q: 5%? Distribution of the sample mean population standard error sample standard error Z(u) distribution and t distribution Z= is distribution as a normal distribution (Z distribution) with = 0 and = 1 is distribution as a t distribution with = 0 and depending on the sample size. The t distributions are symmetric and bell-shaped like the normal distribution but a little flatter, i.e., they have a larger standard deviation. The degrees of freedom is just the sample size minus 1: df = n-1 for any t distribution. t-distribution & Student’s t-test t - distribution was first presented by W. S. Gosset, who published it under the pseudonym “Student” (1908), hence the common reference to “Student’s t distribution” or “Student’s t test” P. 383 t0 Example: one-tailed test for the hypotheses H0: 0 and HA: < 0 The data are weight changes of human, tabulated after administration of a drug proposed to result in weight loss. Each weight change (in kg) is the weight after minus the weight before drug administration 0.2 -0.5 -1.3 -1.6 -0.7 0.4 -0.1 0.0 -0.6 -1.1 -1.2 -0.8 n = 12 Mean = -0.61 kg Variance (s2) = 0.4008 kg2 s = 0.63 kg t 0.05 (1), 11 = 1.796 If t -t 0.05 (1), 11, reject H0 Conclusion: reject H0 and accept HA Review Distribution of the sample mean If a group of 25 older men who are strict vegetarians have a mean blood cholesterol level of 188 mg/dl, would you say that vegetarianism significantly lowers blood cholesterol levels? Explain. Solution. = -3.0 P( < 188) = P (Z ≤ -3.0) = F (-3.0) = 0.0013 (Table C.3) Diet may affect blood cholesterol levels. Review Z distribution and t distribution Z= is distribution as a normal distribution (Z distribution) with = 0 and = 1 is distribution as a t distribution with = 0 and depending on the sample size. The t distributions are symmetric and bell-shaped like the normal distribution but a little flatter, i.e., they have a larger standard deviation. The degrees of freedom is just the sample size minus 1: df = n-1 for any t distribution. Z distribution and t distribution Chi-square distribution 正态离差U服从平均数为0,标准差为1的正 态分布。假定由该总体中随机抽取样本,样 本容量为n,样本值为u1,u2,…,un。 则随机变量 χ2 = u12 + u22 + … + un2 Chi-square distribution χ2 = u12 + u22 + … + un2 = (n-1)s2/ 2 If all possible samples of size n are drawn from a normal population with a variance equal to 2 and for each of these samples the value (n-1)s2/ 2 is computed, this values will form a sampling distribution called a χ2 with n-1 degrees of freedom. The Greek letter “chi” or χ, is pronounced as the “ky” in “sky”. The degree of freedom for the chi-square distribution are often denoted by v [nju:]. Chi-square distribution Chi-square (x) goodness of fit Chi-square goodness of fit is widely used to infer whether the population from which a sample of nominal data came conforms to a certain theoretical distribution. e.g., a plant geneticist may raise 100 progeny from a cross that is hypothesized to result a 3:1 phenotypic ratio of pinkflowered to white-flowered. Perhaps a ratio of 84 pink: 16 white is observed, although out of this total of 100 roses, the geneticist’s hypothesis would predict a ratio of 75 pink: 25 white. The question to be answered, then, is whether the observed frequencies deviate significantly from the frequencies expected if the hypothesis were true Chi-square (x) goodness of fit The following calculation of a statistic called chi-square is used as a measure of how far a sample distribution deviate from a theoretical distribution Here, Oi is the frequency, or number of counts, observed in class i, Ei is the frequency expected in class i if the null hypothesis is true, and the summation is performed over all k categories of data. Larger disagreement between observed and expected frequencies will results in a larger x2 value. Thus, this type of calculation is referred to as a measure of goodness of fit. A calculated x2 value can be as small as zero, in the case of perfect fit. Example: Chi-square goodness of fit for two categories Calculation of chi-square goodness of fit for k = 2 (e.g., data consisting of 100 flower colors to a hypothesized color ratio of 3: 1) H0: The sample data came from a population having a 3: 1 ratio of pink to white flowers HA: The sample data came from a population not having a 3: 1 flower color ratio Categories (flower color) Pink White n Oi 84 16 100 (Ei ) (75) (25) degree of freedom = = k – 1 = 2 – 1 = 1 = (84 – 75)2/75 + (16 – 25)2/25 = 4.320 0.025 < P < 0.05. Therefore, reject H0 and accept HA Example: Chi-square goodness of fit for more than two categories Calculation of chi-square goodness of fit for k = 4 H0: The sample from a population having a 9: 3: 3: 1 color pattern of flowers HA: The sample from a population not having a 9: 3: 3: 1 color pattern of flowers Categories (flower color) Red rayed Red margined Blizzard Rayed Red margined Oi 152 (Ei ) (140.6) Red rayed Rayed Blizzard 39 53 (46.9) 6 n 250 (46.9) (15.6) =k–1=4–1=3 = 8.956 0.025 < P < 0.05. Therefore, reject H0 and accept HA Chi-square correction for continuity Chi-square values obtained from actual data belonging to discrete or discontinuous distribution. However, the theoretical x2 distribution is a continuous distribution. x2 values calculated obtained from discrete data ( = 1 in particular) are often overestimated and may therefore cause us to commit the Type I error with a probability greater than the stated . The Yates correction (see below) should routinely be used when = 1 The log-likelihood ratio (G-test) The x2 test is the traditional method for tests of GOF. The G-test is an alternative to the x2 test for analyzing frequencies. The two methods are interchangeable. The G-test is increasingly used because: it is easier to calculate; mathematicians believe it has theoretical advantages in advanced applications G = 2 O ln (O/E) (ln = natural logarithm) The G-test statistic (G) uses the same tables as the x2 test. The G-test is based on the principle that the ratios of two probabilities can be used as a test statistic to measure the degree of agreement between sampled and expected frequencies. Williams (1976) recommends G be used in preference to x2 whenever any > expected frequency The two methods often yield the same conclusions; when they do not, many statiscians prefer G test and therefore recommend its routine use Example: G-test for more than two categories H0: The sample from a population having a 9: 3: 3: 1 color pattern of flowers HA: The sample from a population not having a 9: 3: 3: 1 color pattern of flowers Categories (flower color) Red rayed Red margined Blizzard Rayed Red margined Oi 152 (Ei ) (140.6) Red rayed Rayed Blizzard 39 (46.9) 53 6 n 250 (46.9) (15.6) =k–1=4–1=3 G = 2 O ln (O/E) = 10.807 0.010 < P < 0.025. Therefore, reject H0 and accept HA F distribution 在概率论和统计学里,F-分布(F-distribution)是 一种连续概率分布,被广泛应用于似然比率检验, 特别是ANOVA中。 一个F-分布的随机变量是两个 卡方分布变量的比率: U1和U2呈卡方分布,它们的自由度分别是d1和d2。 U1和U2是相互独立的。 F distribution F = (S12/ 1 2 ) / (S22/ 2 2)