Download Statistical Bioinformatics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
02/17/03
Chapter 5
Statistical Inference
The artful combination of the mathematical theory of probability with empirical observations
and reasoning of various sciences had given rise to a logic and methodology for drawing
conclusions in the presence of errors, variations, and uncertainty, which is called statistical
inference. Statistics, as we now understand the term, has come to be recognized as a separate
field only in the twentieth century. It is the intellectual product of many minds from highly
diversified backgrounds of astronomy, biology, economics, mathematics, physics, and
psychology. In particular, the pursuit in the studies of heredity in the 1800’s provide a lot of
impetus to the formalization of the concepts of inferential statistics that we use today.
In the large volume of nucleic and amino acid sequence data, sequencing errors, variations
across species, and the uncertainty of the yet unsequenced parts of the genomes are present. The
methodology of statistical inference is a guide that we must rely on when we try to extract
information from such data. This chapter introduces the basic techniques of statistical inference,
which can be classified into two large categories of parameter estimation and hypothesis testing.
5.1
Sampling, Estimation and Hypothesis Testing
Suppose we are interested in finding the average percentage of strong bases (i.e., C/G) in all
hemoglobin mRNA’s that have been sequenced up to date. If we go to GenBank and look for
hemoglobin mRNA sequences, Entrez returns over ten thousand records. Although not
impossible, it is still relatively time consuming to deal with such a large number of sequences. A
reasonable and much easier thing to do is to sample, say, just 30 of those hemoglobin mRNA
sequences, take the average of the C/G percentages in the 30 sample sequences, and use the
sample average to estimate the average C/G percentage of the population of hemoglobin
mRNA's.
The above example illustrates a typical case of statistical inference, which is the technique for
drawing conclusions about certain characteristics of a population based on information obtained
from a sample. Usually, a population is understood to be a large collection of individuals, such
as the population of residents in New York City, the population of all students in a University, or
the population of all hemoglobin mRNA's in GenBank, while a sample refers to a subcollection
(usually much smaller in size) of individuals form the population. For various purposes, we may
be interested in certain characteristics of the population, such as the average height of the New
Yorkers, the average number of credit hours taken by a student in a particular semester, or the
average G/C percentage of the hemoglobin mRNAs. To obtain complete and exact knowledge of
the characteristic of interest, one would need information from every individual of the
population. When this is impractical or inconvenient to accomplish, techniques of statistical
02/17/03
inference provide a way for us to systematically utilize information from the sample to make
estimates, judgements, and decisions related to the population.
Define the terms population parameters and sample statistics. Use of Greek.
There is much randomness involved in the process of sampling, because it is uncertain which 30
mRNA sequences are included in our sample. If each member of your group independently
sample 30 sequences, you will come up with quite different collections, and hence each of your
estimation will be different. This essentially says that the sample average C/G percentage is a
actually random variable. In fact, any quantity that we derive from our sample data, called a
statistic, is a random variable. And we are going to use the value of this random variable as an
estimate of the unknown C/G percentage of the entire population of hemoglobin mRNA in
GenBank.
So, what do we have to do to get this estimate? First, we need to generate a random sample of 30
hemoglobin mRNA sequences. Suppose we let N denote the total number of hemoglobin mRNA
sequence in GenBank (as of 2/21/00, N = 10952). To ensure that we are sampling randomly
without being biased by our personal preferences and practices, we can ask S-Plus to produce a
random sample of 30 numbers between 1 and N. and then select 30 sequences with those
numbers in the generated random sample. Then we can figure out the G/C percentage in each of
these sequences, then take the average of the 30 percentages. Note that due to the randomness in
sampling, this G/C percentage we obtain is a random variable. We refer to it as the estimator
random variable for the unknown population G/C percentage.
Now that we have figured out one estimator (i.e., a method of estimation) for the unknown
population percentage, we need to assess how good this estimator is. We know that it is
unreasonable to hope for getting the exact true population percentage just from the sample.
However, we would be concerned if this estimator will give biased values that are systematically
lower than or systematically higher than the true value. An estimator is called unbiased if its
expected value is equal to the population parameter to be estimated. Also, we need to consider
the precision of the estimator. That is, does the estimator random variable have a lot of variation
in it? We would like our estimator to have small variance. The smaller the variance, the more
precise our estimator will be. Generally, the variance of an estimator decreases with sample size.
If there are two possible estimators, both unbiased, having two different variances at the same
sample size, the one with smaller variance is preferred.
The other main category of statistical inference is hypothesis testing. A hypothesis is any
statement that something is true. Usually, a statistical hypothesis is a statement about population
parameters such as the average G/C percentage in all hemoglobin mRNA sequences. The
statement that the percentage of strong and weak nucleotides are equal in hemoglobin mRNA
can be translated to the hypothesis that µ = 0.5. We can test this hypothesis formally by a
hypothesis test which can be carried out in three steps.
1. Writing down the hypothesis
H0:
H1:
02/17/03
2. Figure out a test statistic, which is a random variable which is a function of quantities that
can be obtained from the sample data, like the sample mean, sample standard deviation, and
sample size. The probability distribution of the test statistic needs to be known (e.g., normal,
Student’s t, chi-square, are the familiar ones). For our example, the test statistic is …,
which has approximately a standard normal distribution.
3. Find the probability that the test statistic random variable has a value as extreme as the value
calculated in Step 2, given that the null hypothesis is true. For example, if we know that the
test statistic here has a standard normal distribution. If the population average is indeed equal
to 0.5, what is the probability that we get a value of z as different from 0.5 as the value we
get in step 2? This probability, called the p-value, can be looked up from the z-table or
statistical software. If this p-value is too small, the probability of obtaining such an
observation is deemed unlikely under the condition that the null hypothesis is true. Hence we
reject the null hypothesis. How small the p-value must be before it is considered too small?
In biological experiments, a p-value is considered too small when is less than 0.05. In
molecular sequence analysis, we may user a much more stringent criteria.
We can test whether or not such a hypothesis should be rejected or not by setting it up as the null
hypothesis and testing it against an alternative hypothesis which states the exact opposite of the
null. Based on the stipulated base probabilities, we should be able to figure out the probabilities
of observing what we observe in the data. If the chance of observing what we observe is too
small, then we will say that the null hypothesis in not believable, and hence must be rejected.
The probability of observing something as different from the prediction of the model is called the
p-value of the hypothesis test.
5.2
The Bayesian Framework
Statisticians have a few general principles to follow when trying to find estimates:
Unbiased, small variance, maximum likelihood, Bayesian, Bootstrap,
5.3
Inference on molecular sequence models