Download Heredity Fundamental statistics

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Inductive probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability box wikipedia , lookup

I am the family face;
Flesh perishes, I live on,
Projecting trait and trace
Through time to times anon,
And leaping from place to place
Over oblivion.
The years-heired feature that can
In curve and voice and eye
Despise the human span
Of durance – that is I;
The eternal thing in man
That heeds no call to die.
Thomas Hardy
Fundamental statistics
3.1.1 Hypothesis testing
If one wishes to claim a certain explanation of how some observed data arose (e.g. that
McDonald’s causes obesity), this may be done by proving that a contradictory explanation is
false (e.g. that McDonalds is not related to obesity). The contradictory hypothesis is call the
‘Null Hypothesis’ (often written H0), and the theory we wish to demonstrate is called the
‘Alternate Hypothesis’ (HA). If it is very unlikely that one would observe the data given the
null hypothesis were true, then we reject the null hypothesis because a statistically significant
deviation from what is expected has occurred, namely the observed data. Note that care is
required when there are several Alternative Hypotheses. In the example, disproving the Null
Hypothesis may not rule out the explanation that McDonalds protects against obesity.
3.1.2 Distributions
A statistic is an observed quantity or a function of observed quantities. Two statistics are said
to have the same distribution if they have the same probability of producing any particular
numerical result. The properties of some standard distributions e.g. the binomial, normal and
chi-squared are well defined. If a statistic has a certain distribution, then properties that are
known about that distribution may also be applied to the statistic.
For example, suppose a statistic is known to have the same distribution as a χ2 (chi-squared)
with 2 degrees of freedom (2df). If a test statistic is applied to the dataset and produces a
result 7.3, and the probability of observing a result of 7.3 or more in a χ2 distribution is 0.026,
then the probability of the test statistic producing the same value or larger must also be 0.026.
Normal (Gaussian or error function) distribution
Variables following a normal distribution are common in the biological sciences. The
distribution is written as N (µ, σ2), meaning that the data it describes have mean value µ and
variance σ2. Many test statistics X are designed to have the property that X ~ N(0,1) as the
sample size tends to ∞, i.e. the distribution of X becomes increasingly similar to that of the
standard normal distribution (mean 0, variance 1) as the sample grows (X is asymptotically
distributed). It follows that P(X>1.64) = 0.0505, P(X>1.96) = 0.0250, P(X>2.33) = 0.0099
and P(X>3.62) = 0.0001.
Binomial distribution
Gives the probability of a certain set of events from multiple repetitions of a trial that has only
two outcomes. The distribution can best be explained in terms of tossing coins and estimating
the number of heads and tails produced. (The probability of getting ‘k’ heads from ‘m’ tosses
of a coin is: m! pk(1-p)m-k / k!(m-k)!, where ‘p’ is the probability of getting a head on any
particular toss (if the coin is fair then p =1/2) and n! = 1*2…*n (note that 0!=1). Summing
over the alternatives gives a bell shaped curve as the number of coins increases).
Chi-squared (χ2) distribution
The χ2 distribution is frequently used to quantify the similarities or differences between two
sets of discrete data i.e. do the sets of data come from the same or different distributions.
There are two main applications. The first application is in the comparison of a set of
observed results against a set of expected results. If one defines a test statistic as
X1 = Σ (Oi-Ei)2/Ei
Where Oi are the observed values and Ei are the expected values according to some
hypothesis, then X1 ~ χ2. The statistic X1 is known as a goodness of fit statistic, and has n-1
degrees of freedom (df) if no parameters are estimated from the observed data. If ‘r’
parameters are estimated from the observed data then the df are n-1-r.
The second application is comparing two or more sets of observed results to investigate
whether they are independent of one another. If the data are summarized in a contingency
table with r rows and c columns, then a test statistic can be defined as:
r c
X2 = ΣΣ (Nij – nij)2
i=1 j=1
where Nij is the number of observations in the ith row and jth column of the table and nij is
the expected number of observations in that cell. The statistic X2 follows a chi-squared
distribution with (r-1)(c-1) degrees of freedom. (The critical values depend on the latter, but if
1df, then P(X>3.841) = 0.05 etc….for more values consult Distribution tables).
3.1.3 P-values and significance
We cannot definitely prove (at least using statistical methods) that one explanation of events
is true. Generally, given a particular data set, a test will produce the probability of observing
the results that are equally, or more extreme than the data, if the null hypothesis were correct.
This is called the p=value (Note that this is not the same as the probability of the null
hypothesis given the observed data). E.g suppose we are interested in determining whether a
disease has a different prevalence in men versus women. (We assume the overall population
is half men and half women). Let p be the proportion of men in the affected population. The
null hypothesis that half the affected cases are men can be expressed as:
H0: p=1/2
The alternate hypothesis is:
HA: p≠1/2
Of the eight cases observed last year from one hospital, 7 are men. These cases are unrelatred
and not connected in any way, so it can be assumed that they are independent. We further
assume that the number of affected men in a given population has a binomial distribution.
Under the null hypothesis, the probability of observing 7 or eight same sex individuals (either
male or female) is:
P-value = 2(1/2)8 + 2{ }(1/2)8 = 0.070
{ } = n! / k!(n-k)!
is the binomial coefficient and is pronounced “n choose k” (it is the number of ways to choose
k objects from a set of n objects). The leading factors of 2 are from the symmetry in this
problem between male and female: 7 males and 1 female counts the same as 1 male and 7
females. This p-value can be interpreted as: Suppose there is no gender difference in
observations of disease. If we surveyed 1000 identical hospitals with 8 cases each, then we
would that in 70 of these hospitals, out of the eight cases 7 or 8 would have the same gender.
Before applying a test, a significance level must be designated. The significance level is a cut
off. The null hypothesis is rejected if the P-value is less than the cutoff. We used a 2-sided
test as there was no a priori information that an abundance of women was impossible. That is,
deviation from the null hypothesis could occur in either direction. Like the significance level,
the statistical test should not be changed after examining the data. Tests of the recombination
fraction, θ, are one-sided since θ cannot exceed ½. (H0: θ = ½, HA: θ < ½). Using a one
sided test is more likely to reject the null hypothesis when the alternative hypothesis is true
(i.e. one-sided tests have more power) but extreme care must be used to avoid misinterpreting
the results. For example, supposed that based on the preliminary study above, we suspect that
a form of the disease may be X-linked. If it is x-linked then the number of men among the
affecteds should exceed the number of women. We could use a one-sides test:
H0: p = ½,
HA: p > ½.
This time a larger sample of 100 affected cases is observed in which 30 were men. Since the
sample size is large, we use a normal approximation to the binomial to calculate the p-value.
P-value = P(number of male cases ≥ 30 | N = 100, p=1/2) > 0.999.
The null hypothesis is not rejected at a significance level of 0.05. Note that acceptance of the
null for the one sided test does not mean the proportion of men among the affected is close to
½ in this sample, only that there is no evidence in these data to support the X-linked
hypothesis. If we had chosen a two-sided test (testing for a gender difference) then the p-
value would equal the value of observing 70 or more members of either sex under the null
hypothesis of equal numbers of men and women affected. This probability is less than
3.1.4 Likelihood
In general use the word likelihood is a synonym for probability but in statistics it has a more
specific meaning; it is the probability (or probability density) of the observed data given the
probability model that gave rise to the data. Likelihood is used to compare different possible
candidate values for the parameters of the model, and for this purpose it needs only to be
defined up to a constant of proportionality; any constant multiple of the likelihood serves
equally well. When comparing two candidate values for a parameter, the one with the greater
likelihood is said to be more likely, and parameter values for which the probability of the
observed data is greatest are known as the most likely values, or maximum likelihood
estimates (MLE).
Eg. Let 10 subjects be followed for 5 years, and a record made of whether they die (fail) or
survive. A simple probability model is that the outcome for each subject is independently
random with probability π for failure and 1-π for survival. The probability π is the parameter
of the model. When four subjects fail, and six survive, the probability of the observed data is
found from the binomial distribution to be:
L(π) = 210π4(1-π)6.
Suppose we wish to compare π = 0.1 with π = 0.5 as possible values for the true value which
gave rise to the data. The two likelihoods are L(0.1) = 0.0112 and L(0.5) = 0.2051, so π = 0.5
is more likely than π = 0.1. The most likely value of π = 0.4, which has likelihood 0.2508.
Since the likelihood can be scaled by any constant without altering such comparisons it is
often convenient to take the value 1 when∧π takes its∧ most likely value. The scale likelihood
for π is then the likelihood ratio L(π) / L(π), where π is the most likely value for π.
Likelihood ratios are most easily studied as differences in log likelihoods; in this example the
log likelihood is l(π) = 4log(π) + 6log(1-π).
3.1.5 Confidence interval
Many tests produce a single numerical result with a pair of bounds in the form of a percentile
confidence interval. For instance a result may have a mean value of α with 75% confidence
limits of B and C; this is interpreted as meaning that there is a probability of 0.75 that the true
mean value within the population lies in the range (B,C) and that the most likely value is α.
3.1.6 Errors
In statistical testing there are two categories of errors:
Type I error: Type I errors are false positives, rejecting the Null hypothesis when it is true.
For instance, returning to the earlier example, even when the true numbers of men and womes
affected with the disease are equal, a randomly selected sample of 8 affected people will have
everyone of the same sex in 8 out of every 1000 trials. If by chance one of these 8 samples
were selected, then the null hypothesis would be rejected at a significance level of 0.01. The
probability of a false positive is typically written as α. Thus in this example α = 0.008.
Type II error: Type II errors are false negatives, failing to reject the null hypothesis when it
is false. For instance even when there were a difference in the numbers of mean and women
affected, samples with 4 or more affected men and 4 affected women will be found. As the
probability that a man is affected tends towards 0 or 1, then the chance of finding data this
balanced tends to zero. The probability of a false negative is typically written as β.
Our decision
Accept H0
Reject H0
The true state of nature
H0 is true
H1 is true
Correct decision
Type II error
Type I error
Correct decision
3.1.7 Power
The power of a test is the probability of rejecting the null hypothesis given that the alternative
hypothesis is true. Power is (1-β). The power of a test can only be defined in the context of
specific circumstance. For example it would be valid to say “the affected sib-pair method has
a power of 0.76 to detect linkage between a fully penetrant recessive disease locus and a
marker 20cM distant using a dataset of 20 fully informative sib-pairs at a significance of 0.04
in the absence of phenocopies due to other effects”. However, omitting the nature of disease,
the marker spacing, data set size, informativity or significance would make the sentence
meaningless. The same test will have different powers when applied under different
circumstances. Power comparisons (for instance “ test A was shown to be 25% more
powerful than test B”) are invalid unless the exact situation is specified.
Whenever possible, power estimations should be performed at the outset of a study, since they
will produce information on the magnitude of the effect that can be detected and the size of
the datasets required. Contrasting the powers of various techniques under alternate
circumstances may suggest ways of improving experiments (for instance by changing the
types of families being collected for a genome search). If the disease model is not clear, then
power may be calculated under a range of reasonable models.
3.1.8 Multiple testing
A typical genome search will use several hundred markers, and test each of these against one
or more phenotypes. It is inherent in the definition of a p-value and significance level that
some false positive results will be generated. In particular using a significance level of 1/n
with m independent tests will produce an average m/n false positives. Eg. Testing 600
markers and using a significance level of 0.025 will result in about 15 such mistakes.
A simple but naïve method of combating this is to divide the original significance level by the
total number of tests, so that on average the experimenter would then expect only one false
positive. A more elegant solution is to use the “Bonferroni Correction’ which assumes the
tests are mutually independent, and so arrives at the formula:
αi = 1 – (1 - αn)1/n
where αi is the significance level for each individual test and αn is the overall significance
level after n tests. E.g. if n = 600 and the desired overall significance level αn is 0.025, then
αi is 4.22 x 10-5, which is quite small (although tests on all 600 markers are not likely to be
mutually independent, as statistics at adjacent markers will be correlated). With large
numbers of tests attempting to locate minor perturbations in the dataset, the resulting
significance may be so low that a true result is unlikely to reach the threshold. This is a
general problem, in that attempting to reduce the number of false positives (Type I Error) by
using a more stringent significance level will cause a corresponding decrease in the power of a
test (increased Type II Error), and vice versa, for a given amount of data.
With large scale genome screens for multifactorial genes reducing the significance level may
produce unacceptable decreases in power. In such circumstances the only viable solution is to
modify the overall nature of the testing, normally by accepting that false positives will be
generated if the ‘true’ results are also to be found, and then seeking supplementary evidence
to distinguish between true and false results.