Download Statistical Bioinformatics

02/17/03 Chapter 5 Statistical Inference The artful combination of the mathematical theory of probability with empirical observations and reasoning of various sciences had given rise to a logic and methodology for drawing conclusions in the presence of errors, variations, and uncertainty, which is called statistical inference. Statistics, as we now understand the term, has come to be recognized as a separate field only in the twentieth century. It is the intellectual product of many minds from highly diversified backgrounds of astronomy, biology, economics, mathematics, physics, and psychology. In particular, the pursuit in the studies of heredity in the 1800’s provide a lot of impetus to the formalization of the concepts of inferential statistics that we use today. In the large volume of nucleic and amino acid sequence data, sequencing errors, variations across species, and the uncertainty of the yet unsequenced parts of the genomes are present. The methodology of statistical inference is a guide that we must rely on when we try to extract information from such data. This chapter introduces the basic techniques of statistical inference, which can be classified into two large categories of parameter estimation and hypothesis testing. 5.1 Sampling, Estimation and Hypothesis Testing Suppose we are interested in finding the average percentage of strong bases (i.e., C/G) in all hemoglobin mRNA’s that have been sequenced up to date. If we go to GenBank and look for hemoglobin mRNA sequences, Entrez returns over ten thousand records. Although not impossible, it is still relatively time consuming to deal with such a large number of sequences. A reasonable and much easier thing to do is to sample, say, just 30 of those hemoglobin mRNA sequences, take the average of the C/G percentages in the 30 sample sequences, and use the sample average to estimate the average C/G percentage of the population of hemoglobin mRNA's. The above example illustrates a typical case of statistical inference, which is the technique for drawing conclusions about certain characteristics of a population based on information obtained from a sample. Usually, a population is understood to be a large collection of individuals, such as the population of residents in New York City, the population of all students in a University, or the population of all hemoglobin mRNA's in GenBank, while a sample refers to a subcollection (usually much smaller in size) of individuals form the population. For various purposes, we may be interested in certain characteristics of the population, such as the average height of the New Yorkers, the average number of credit hours taken by a student in a particular semester, or the average G/C percentage of the hemoglobin mRNAs. To obtain complete and exact knowledge of the characteristic of interest, one would need information from every individual of the population. When this is impractical or inconvenient to accomplish, techniques of statistical 02/17/03 inference provide a way for us to systematically utilize information from the sample to make estimates, judgements, and decisions related to the population. Define the terms population parameters and sample statistics. Use of Greek. There is much randomness involved in the process of sampling, because it is uncertain which 30 mRNA sequences are included in our sample. If each member of your group independently sample 30 sequences, you will come up with quite different collections, and hence each of your estimation will be different. This essentially says that the sample average C/G percentage is a actually random variable. In fact, any quantity that we derive from our sample data, called a statistic, is a random variable. And we are going to use the value of this random variable as an estimate of the unknown C/G percentage of the entire population of hemoglobin mRNA in GenBank. So, what do we have to do to get this estimate? First, we need to generate a random sample of 30 hemoglobin mRNA sequences. Suppose we let N denote the total number of hemoglobin mRNA sequence in GenBank (as of 2/21/00, N = 10952). To ensure that we are sampling randomly without being biased by our personal preferences and practices, we can ask S-Plus to produce a random sample of 30 numbers between 1 and N. and then select 30 sequences with those numbers in the generated random sample. Then we can figure out the G/C percentage in each of these sequences, then take the average of the 30 percentages. Note that due to the randomness in sampling, this G/C percentage we obtain is a random variable. We refer to it as the estimator random variable for the unknown population G/C percentage. Now that we have figured out one estimator (i.e., a method of estimation) for the unknown population percentage, we need to assess how good this estimator is. We know that it is unreasonable to hope for getting the exact true population percentage just from the sample. However, we would be concerned if this estimator will give biased values that are systematically lower than or systematically higher than the true value. An estimator is called unbiased if its expected value is equal to the population parameter to be estimated. Also, we need to consider the precision of the estimator. That is, does the estimator random variable have a lot of variation in it? We would like our estimator to have small variance. The smaller the variance, the more precise our estimator will be. Generally, the variance of an estimator decreases with sample size. If there are two possible estimators, both unbiased, having two different variances at the same sample size, the one with smaller variance is preferred. The other main category of statistical inference is hypothesis testing. A hypothesis is any statement that something is true. Usually, a statistical hypothesis is a statement about population parameters such as the average G/C percentage in all hemoglobin mRNA sequences. The statement that the percentage of strong and weak nucleotides are equal in hemoglobin mRNA can be translated to the hypothesis that µ = 0.5. We can test this hypothesis formally by a hypothesis test which can be carried out in three steps. 1. Writing down the hypothesis H0: H1: 02/17/03 2. Figure out a test statistic, which is a random variable which is a function of quantities that can be obtained from the sample data, like the sample mean, sample standard deviation, and sample size. The probability distribution of the test statistic needs to be known (e.g., normal, Student’s t, chi-square, are the familiar ones). For our example, the test statistic is …, which has approximately a standard normal distribution. 3. Find the probability that the test statistic random variable has a value as extreme as the value calculated in Step 2, given that the null hypothesis is true. For example, if we know that the test statistic here has a standard normal distribution. If the population average is indeed equal to 0.5, what is the probability that we get a value of z as different from 0.5 as the value we get in step 2? This probability, called the p-value, can be looked up from the z-table or statistical software. If this p-value is too small, the probability of obtaining such an observation is deemed unlikely under the condition that the null hypothesis is true. Hence we reject the null hypothesis. How small the p-value must be before it is considered too small? In biological experiments, a p-value is considered too small when is less than 0.05. In molecular sequence analysis, we may user a much more stringent criteria. We can test whether or not such a hypothesis should be rejected or not by setting it up as the null hypothesis and testing it against an alternative hypothesis which states the exact opposite of the null. Based on the stipulated base probabilities, we should be able to figure out the probabilities of observing what we observe in the data. If the chance of observing what we observe is too small, then we will say that the null hypothesis in not believable, and hence must be rejected. The probability of observing something as different from the prediction of the model is called the p-value of the hypothesis test. 5.2 The Bayesian Framework Statisticians have a few general principles to follow when trying to find estimates: Unbiased, small variance, maximum likelihood, Bayesian, Bootstrap, 5.3 Inference on molecular sequence models

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Statistical Bioinformatics