Biostatistics Biostatistics • Statistics refers to the analysis and interpretation of data with a view toward objective evaluation of the reliability of the conclusions based on the data. • Statistics applied to biological problems is called biostatistics / biometry. Applications of Statistics in Bioinformatics • • • • • • • • • • • • • • • • Descriptive Summaries Clinical diagnosis Equipment calibration Experimental data analysis Gene expression prediction Gene hunting Gene prediction Genetic linkage analysis Laboratory automation Nucleotide alignment Population studies Protein function prediction Protein structure prediction Quantifying uncertainty Quality control Sequence similarity. Basic Concepts - I • Data – Any kind of numbers – Statistical analyses need numbers • Statistics – Concerned with collection, organization, and analysis of data – Drawing inferences about a population when only a sample of the population is studied • Summary – Data are numbers, numbers contain information, statistics investigate and evaluate the nature and meaning of this information Basic Concepts - II • Sources of Data – Routinely kept records – Surveys – Experiments / Research Studies • Biostatistics – Statistics applied to biological sciences and medicine – Statistics including not only analytic techniques but also study design issues Variables • Variable: A characteristic that differs from one biological entity to another. • Continuous Variable: A variable for which there is a possible value between any other two possible values. Eg: Height. • Discrete variable: A variable that can take only certain values. Eg: No.of leaves Accuracy & Precision • Accuracy is the nearness of a measurement to the actual value of the variable being measured. • Precision refers to the closeness to each other of repeated measurements of the same quantity. Frequency table & Frequency Distributions. • Frequency table: involves a listing of all the observed values of the variable being studied and how many times each value is observed. Helps summarize large amounts of data. • Frequency distribution: The distribution of the total number of observations among the various categories is called a frequency distribution. • Represented graphically as a bar graph, histogram, Frequency polygons etc. Population and Samples • Population: The entire collection of measurements about which one wishes to draw conclusions is the population / universe. • Sample: The subset of all the measurements in the population is called the sample. • Random sampling: The selection of any member of the population in no way influences the selection of any other member, i.e each member of the population has an equal and independent chance of being selected. Randomness • Data are inherently noisy and randomness is inherent in any sampling process. • Every measurement system introduces noise-random variability-into the desired signal. • The noise can be minimized by controlling the external environment or more often by reducing the bandwidth of the system using statistical techniques. • By reducing the bandwidth of acceptable (good) data, it can be more readily differentiated from bad data and made more apparent and available. Eg: Analysis of intra-array spot fluorescence intensity can be used to control for contamination and other sources of variability. Simple Random Sample • Reason – sample a ‘small’ number of subjects from a population to make inference about the population – Essence of statistical inference • Definition – A sample of size n drawn from a population of size N in such a way that every possible sample of size n has the same chance of being selected • Sampling with and without replacement – In biostatistics, most sampling done without replacement Interface Noise • Much of bioinformatics work involves interfacing mechanical, biological and electronic systems and each interface introduces noise and variability in the overall process. Eg: Translating analog fluorescence intensity to a digital signal introduces noise, decreases overall system dynamic range and adds non-linearities and variability to the gene expression data. Similarly the mechanical and optical-to-digital interfaces in a nucleotide sequencing machine contribute noise, errors and random variability to sequence data. Descriptive Statistics Measures of Location • Descriptive measure computed from sample data statistic • Descriptive measure computed from population data - parameter • Most common measures of location – – – – Mean Median Mode Geometric Mean Descriptive Statistics - Arithmetic mean • Probably most common of the measures of central tendency – a.k.a. ‘average’ • Definition – Normal distribution, although we tend to use it regardless of distribution • Weakness xi x n values – Influenced by extreme • Translations – Additive – Multiplicative Descriptive Statistics Median • Frequently used if there are extreme values in a distribution or if the distribution is non-normal • Definition – That value that divides the ‘ordered array’ into two equal parts • If an odd number of observations, the median will be the (n+1)/2 observation – ex.: median of 11 observations is the 6th observation • If an even number of observations, the median will be the midpoint between the middle two observations – ex.: median of 12 observations is the midpoint between 6th and 7th • Comparison of mean and median indicates skewness of distribution Descriptive Statistics Mode • Not used very frequently in practice • Definition – Value that occurs most frequently in data set • If all values different, no mode • May be more than one mode – Bimodal or multimodal Descriptive Statistics Geometric mean • Used to describe data with an extreme skewness to the right – Ex., laboratory data: lipid measurements • Definition – Antilog of the mean of the log xi Descriptive Statistics Measures of Dispersion • Dispersion of a set of observations is the variety exhibited by the observations – If all values are the same, no dispersion – More the values are spread, the greater the dispersion • Many distributions are well-described by measure of location and dispersion • Common measures – – – – – Range Quantiles Variance Standard deviation Coefficient of variation Descriptive Statistics Range • Range is the difference between the smallest and largest values in the data set – Heavily influenced by two most extreme values and ignores the rest of the distribution Descriptive Statistics Variance • Variance measures distribution of values around their mean • Definition 2 of sample variance s ( xi x ) 2 /( n 1) • Degrees of freedom – n-1 used because if we know n-1 deviations, the nth deviation is known – Deviations have to sum to zero Descriptive Statistics Standard Deviation • Definition of sample standard deviation s s2 • Standard deviation in same units as mean – Variance in units2 • Translations – Additive – Multiplicative Descriptive Statistics Coefficient of Variation • Relative variation rather than absolute variation such as standard deviation • Definition of C.V. C.V . s (100) x • Useful in comparing variation between two distributions – Used particularly in comparing laboratory measures to identify those determinations with more variation – Also used in QC analyses for comparing observers Sampling and distributions • Population mean and variance are estimated by sampling population data and drawing inferences from the sample data based in part on assumptions of how the data are distributed in the population. • Distributions used in statistical analysis: Discrete random variables: Binomial, Poisson and Hypergeometric distributions. Continuous random variables: Normal distribution, Z distribution. Eg: The analysis of discrete random variables, such as the position of a nucleotide on a given sequence may use techniques based on a binomial distribution and not techniques that assume a normal distribution. Hypothesis Testing • Hypothesis testing deals with the null hypothesis and the alternate hypothesis. • The null hypothesis is usually assumed to hold unless there is enough evidence to reject it. Eg: In Microarray work, a typical hypothesis is that two microarrays that have been subjected to the same spotting and hybridization process will produce identical gene expression fluorescence results. The degree to which this hypothesis is true can be estimated by examining the gene expression scatter plots created from data gleaned from each microarray and correlating the values mathematically. Z score • A statistic commonly used in alignment searches. • It is a measure of the distance from the mean, measured in standard deviation units. • If each sequence to be aligned is randomized and an optimal alignment is made, the result is a series of scores (S) for the alignment of two sequences with a mean(µ) and standard deviation (δ). • The Z score Z = (S - µ ) / δ Z - score • The advantage of a Z score over a simple percentage score is that it corrects for compositional biases in the sequence and accounts for varying length of sequences. • Z scores assume a normal distribution, whereas alignment data don’t follow a normal distribution. • As a result a higher z score is taken as a threshold of significance. Graphical Methods Bar Graphs and Histogram • Histogram graph of frequencies - special form of bar graph – Can be used to visually compare frequencies – Easier to assess magnitude of differences rather than trying to judge numbers • Frequency polygon - similar to histogram Summary • In practice, descriptive statistics play a major role – Always the first 1-2 tables/figures in a paper – Statistician needs to know about each variable before deciding how to analyze to answer research questions • In any analysis, 90% of the effort goes into setting up the data – Descriptive statistics are part of that 90% Distributions in Bioinformatics • Binomial distributions are used for spotting stretches of DNA with unusual nucleotide sequences and pair-wise sequence comparisons. • Normal distributions are used for modeling continuous random variables with applications such as the statistical significance of pairwise sequence comparison. • Multinomial distributions are used for spotting stretches of DNA with unusual content, distinguishing tests for introns by composition and quantifying relative codon frequency. Software • Statistical software – – – – – – SAS SPSS Stata BMDP MINITAB Excel?? • Graphical software – – – – From list above Sigmaplot Harvard Graphics Axum Case Study - Microarray • Microarrays offer an efficient method of gathering data that can be used to determine the expression patterns of tens of thousands of genes in only a few hours. • Microarrays allow researchers to examine the mRNA from different tissues in normal and disease states to determine which genes and environmental conditions lead to disease Microarray analysis • Analysis of the flourescence data includes a check for micro-array to microarray variability using a scatter plot. • Gene expression levels are measured by adequately quantifying the flourescence associated with each spot. • The most common methods of achieving this is to rely on simple descriptive statistics such as mean, mode and median. Microarray analysis • The total pixel intensity is the sum of all pixels corresponding to fluorescence in an area. • The volume measure is the sum of signal intensity above background noise for each pixel. • Role of statistical analysis in reading the intensity value associated with each spot is to control for variability. The inter and intra microarray comparisons are used to identify contamination and other sources of variability. Microarray analysis • The mean is the average pixel density over a spot, corresponding to the average fluorescence intensity. The advantage of measuring the mean intensity level is that it decreases the error due to variance in DNA deposition during microarray work. • The mode is the most likely intensity value, represented by the highest peak in the fluorescence plot. • The median is the mid-point in the intensity plot. Microarray analysis • A quick check for data validity is to create a scatter plot of flourescence data from two identically treated microarrays. • (Refer fig 6-4, Pg:226 – Bioinformatics Computing Bk) • The ideal condition is when gene expressions measured by the microarrays are identical as indicated by data on the 45-degree ID line as in (A). Microarray analysis • If the amplitude of gene expression on one microarray is greater than the other, data fall off the ID line as in (B) and (C). • The scatter plot also provides a measure of gene expression amplitude, in that the greater the distance from the origin, the greater the expression amplitude. • For example the gene plotted at position (C) Has a greater expression amplitude than the gene at position (A).