Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

Heredity I am the family face; Flesh perishes, I live on, Projecting trait and trace Through time to times anon, And leaping from place to place Over oblivion. The years-heired feature that can In curve and voice and eye Despise the human span Of durance – that is I; The eternal thing in man That heeds no call to die. Thomas Hardy Fundamental statistics 3.1.1 Hypothesis testing If one wishes to claim a certain explanation of how some observed data arose (e.g. that McDonald’s causes obesity), this may be done by proving that a contradictory explanation is false (e.g. that McDonalds is not related to obesity). The contradictory hypothesis is call the ‘Null Hypothesis’ (often written H0), and the theory we wish to demonstrate is called the ‘Alternate Hypothesis’ (HA). If it is very unlikely that one would observe the data given the null hypothesis were true, then we reject the null hypothesis because a statistically significant deviation from what is expected has occurred, namely the observed data. Note that care is required when there are several Alternative Hypotheses. In the example, disproving the Null Hypothesis may not rule out the explanation that McDonalds protects against obesity. 3.1.2 Distributions A statistic is an observed quantity or a function of observed quantities. Two statistics are said to have the same distribution if they have the same probability of producing any particular numerical result. The properties of some standard distributions e.g. the binomial, normal and chi-squared are well defined. If a statistic has a certain distribution, then properties that are known about that distribution may also be applied to the statistic. For example, suppose a statistic is known to have the same distribution as a χ2 (chi-squared) with 2 degrees of freedom (2df). If a test statistic is applied to the dataset and produces a result 7.3, and the probability of observing a result of 7.3 or more in a χ2 distribution is 0.026, then the probability of the test statistic producing the same value or larger must also be 0.026. Normal (Gaussian or error function) distribution Variables following a normal distribution are common in the biological sciences. The distribution is written as N (µ, σ2), meaning that the data it describes have mean value µ and variance σ2. Many test statistics X are designed to have the property that X ~ N(0,1) as the sample size tends to ∞, i.e. the distribution of X becomes increasingly similar to that of the standard normal distribution (mean 0, variance 1) as the sample grows (X is asymptotically distributed). It follows that P(X>1.64) = 0.0505, P(X>1.96) = 0.0250, P(X>2.33) = 0.0099 and P(X>3.62) = 0.0001. Binomial distribution Gives the probability of a certain set of events from multiple repetitions of a trial that has only two outcomes. The distribution can best be explained in terms of tossing coins and estimating the number of heads and tails produced. (The probability of getting ‘k’ heads from ‘m’ tosses of a coin is: m! pk(1-p)m-k / k!(m-k)!, where ‘p’ is the probability of getting a head on any particular toss (if the coin is fair then p =1/2) and n! = 1*2…*n (note that 0!=1). Summing over the alternatives gives a bell shaped curve as the number of coins increases). Chi-squared (χ2) distribution The χ2 distribution is frequently used to quantify the similarities or differences between two sets of discrete data i.e. do the sets of data come from the same or different distributions. There are two main applications. The first application is in the comparison of a set of observed results against a set of expected results. If one defines a test statistic as X1 = Σ (Oi-Ei)2/Ei Where Oi are the observed values and Ei are the expected values according to some hypothesis, then X1 ~ χ2. The statistic X1 is known as a goodness of fit statistic, and has n-1 degrees of freedom (df) if no parameters are estimated from the observed data. If ‘r’ parameters are estimated from the observed data then the df are n-1-r. The second application is comparing two or more sets of observed results to investigate whether they are independent of one another. If the data are summarized in a contingency table with r rows and c columns, then a test statistic can be defined as: r c X2 = ΣΣ (Nij – nij)2 i=1 j=1 nij where Nij is the number of observations in the ith row and jth column of the table and nij is the expected number of observations in that cell. The statistic X2 follows a chi-squared distribution with (r-1)(c-1) degrees of freedom. (The critical values depend on the latter, but if 1df, then P(X>3.841) = 0.05 etc….for more values consult Distribution tables). 3.1.3 P-values and significance We cannot definitely prove (at least using statistical methods) that one explanation of events is true. Generally, given a particular data set, a test will produce the probability of observing the results that are equally, or more extreme than the data, if the null hypothesis were correct. This is called the p=value (Note that this is not the same as the probability of the null hypothesis given the observed data). E.g suppose we are interested in determining whether a disease has a different prevalence in men versus women. (We assume the overall population is half men and half women). Let p be the proportion of men in the affected population. The null hypothesis that half the affected cases are men can be expressed as: H0: p=1/2 The alternate hypothesis is: HA: p≠1/2 Of the eight cases observed last year from one hospital, 7 are men. These cases are unrelatred and not connected in any way, so it can be assumed that they are independent. We further assume that the number of affected men in a given population has a binomial distribution. Under the null hypothesis, the probability of observing 7 or eight same sex individuals (either male or female) is: 8 P-value = 2(1/2)8 + 2{ }(1/2)8 = 0.070 7 n Here { } = n! / k!(n-k)! k is the binomial coefficient and is pronounced “n choose k” (it is the number of ways to choose k objects from a set of n objects). The leading factors of 2 are from the symmetry in this problem between male and female: 7 males and 1 female counts the same as 1 male and 7 females. This p-value can be interpreted as: Suppose there is no gender difference in observations of disease. If we surveyed 1000 identical hospitals with 8 cases each, then we would that in 70 of these hospitals, out of the eight cases 7 or 8 would have the same gender. Before applying a test, a significance level must be designated. The significance level is a cut off. The null hypothesis is rejected if the P-value is less than the cutoff. We used a 2-sided test as there was no a priori information that an abundance of women was impossible. That is, deviation from the null hypothesis could occur in either direction. Like the significance level, the statistical test should not be changed after examining the data. Tests of the recombination fraction, θ, are one-sided since θ cannot exceed ½. (H0: θ = ½, HA: θ < ½). Using a one sided test is more likely to reject the null hypothesis when the alternative hypothesis is true (i.e. one-sided tests have more power) but extreme care must be used to avoid misinterpreting the results. For example, supposed that based on the preliminary study above, we suspect that a form of the disease may be X-linked. If it is x-linked then the number of men among the affecteds should exceed the number of women. We could use a one-sides test: H0: p = ½, HA: p > ½. This time a larger sample of 100 affected cases is observed in which 30 were men. Since the sample size is large, we use a normal approximation to the binomial to calculate the p-value. P-value = P(number of male cases ≥ 30 | N = 100, p=1/2) > 0.999. The null hypothesis is not rejected at a significance level of 0.05. Note that acceptance of the null for the one sided test does not mean the proportion of men among the affected is close to ½ in this sample, only that there is no evidence in these data to support the X-linked hypothesis. If we had chosen a two-sided test (testing for a gender difference) then the p- value would equal the value of observing 70 or more members of either sex under the null hypothesis of equal numbers of men and women affected. This probability is less than 0.0001. 3.1.4 Likelihood In general use the word likelihood is a synonym for probability but in statistics it has a more specific meaning; it is the probability (or probability density) of the observed data given the probability model that gave rise to the data. Likelihood is used to compare different possible candidate values for the parameters of the model, and for this purpose it needs only to be defined up to a constant of proportionality; any constant multiple of the likelihood serves equally well. When comparing two candidate values for a parameter, the one with the greater likelihood is said to be more likely, and parameter values for which the probability of the observed data is greatest are known as the most likely values, or maximum likelihood estimates (MLE). Eg. Let 10 subjects be followed for 5 years, and a record made of whether they die (fail) or survive. A simple probability model is that the outcome for each subject is independently random with probability π for failure and 1-π for survival. The probability π is the parameter of the model. When four subjects fail, and six survive, the probability of the observed data is found from the binomial distribution to be: L(π) = 210π4(1-π)6. Suppose we wish to compare π = 0.1 with π = 0.5 as possible values for the true value which gave rise to the data. The two likelihoods are L(0.1) = 0.0112 and L(0.5) = 0.2051, so π = 0.5 is more likely than π = 0.1. The most likely value of π = 0.4, which has likelihood 0.2508. Since the likelihood can be scaled by any constant without altering such comparisons it is often convenient to take the value 1 when∧π takes its∧ most likely value. The scale likelihood for π is then the likelihood ratio L(π) / L(π), where π is the most likely value for π. Likelihood ratios are most easily studied as differences in log likelihoods; in this example the log likelihood is l(π) = 4log(π) + 6log(1-π). 3.1.5 Confidence interval Many tests produce a single numerical result with a pair of bounds in the form of a percentile confidence interval. For instance a result may have a mean value of α with 75% confidence limits of B and C; this is interpreted as meaning that there is a probability of 0.75 that the true mean value within the population lies in the range (B,C) and that the most likely value is α. 3.1.6 Errors In statistical testing there are two categories of errors: Type I error: Type I errors are false positives, rejecting the Null hypothesis when it is true. For instance, returning to the earlier example, even when the true numbers of men and womes affected with the disease are equal, a randomly selected sample of 8 affected people will have everyone of the same sex in 8 out of every 1000 trials. If by chance one of these 8 samples were selected, then the null hypothesis would be rejected at a significance level of 0.01. The probability of a false positive is typically written as α. Thus in this example α = 0.008. Type II error: Type II errors are false negatives, failing to reject the null hypothesis when it is false. For instance even when there were a difference in the numbers of mean and women affected, samples with 4 or more affected men and 4 affected women will be found. As the probability that a man is affected tends towards 0 or 1, then the chance of finding data this balanced tends to zero. The probability of a false negative is typically written as β. Our decision Accept H0 Reject H0 The true state of nature H0 is true H1 is true Correct decision Type II error 1-α β Type I error Correct decision α 1-β 3.1.7 Power The power of a test is the probability of rejecting the null hypothesis given that the alternative hypothesis is true. Power is (1-β). The power of a test can only be defined in the context of specific circumstance. For example it would be valid to say “the affected sib-pair method has a power of 0.76 to detect linkage between a fully penetrant recessive disease locus and a marker 20cM distant using a dataset of 20 fully informative sib-pairs at a significance of 0.04 in the absence of phenocopies due to other effects”. However, omitting the nature of disease, the marker spacing, data set size, informativity or significance would make the sentence meaningless. The same test will have different powers when applied under different circumstances. Power comparisons (for instance “ test A was shown to be 25% more powerful than test B”) are invalid unless the exact situation is specified. Whenever possible, power estimations should be performed at the outset of a study, since they will produce information on the magnitude of the effect that can be detected and the size of the datasets required. Contrasting the powers of various techniques under alternate circumstances may suggest ways of improving experiments (for instance by changing the types of families being collected for a genome search). If the disease model is not clear, then power may be calculated under a range of reasonable models. 3.1.8 Multiple testing A typical genome search will use several hundred markers, and test each of these against one or more phenotypes. It is inherent in the definition of a p-value and significance level that some false positive results will be generated. In particular using a significance level of 1/n with m independent tests will produce an average m/n false positives. Eg. Testing 600 markers and using a significance level of 0.025 will result in about 15 such mistakes. A simple but naïve method of combating this is to divide the original significance level by the total number of tests, so that on average the experimenter would then expect only one false positive. A more elegant solution is to use the “Bonferroni Correction’ which assumes the tests are mutually independent, and so arrives at the formula: αi = 1 – (1 - αn)1/n where αi is the significance level for each individual test and αn is the overall significance level after n tests. E.g. if n = 600 and the desired overall significance level αn is 0.025, then αi is 4.22 x 10-5, which is quite small (although tests on all 600 markers are not likely to be mutually independent, as statistics at adjacent markers will be correlated). With large numbers of tests attempting to locate minor perturbations in the dataset, the resulting significance may be so low that a true result is unlikely to reach the threshold. This is a general problem, in that attempting to reduce the number of false positives (Type I Error) by using a more stringent significance level will cause a corresponding decrease in the power of a test (increased Type II Error), and vice versa, for a given amount of data. With large scale genome screens for multifactorial genes reducing the significance level may produce unacceptable decreases in power. In such circumstances the only viable solution is to modify the overall nature of the testing, normally by accepting that false positives will be generated if the ‘true’ results are also to be found, and then seeking supplementary evidence to distinguish between true and false results.