Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Marshall University School of Medicine Department of Biochemistry and Microbiology BMS 617 Lecture 1: Introduction, Samples and Populations, Confidence Intervals January 11th 2016 Marshall University Genomics Core Facility Why Statistics? • Why do the Biomedical Sciences need statistics? – We need to quantify the extent to which our experiments are repeatable – We need to be able to generalize from a sample to a population – We need to be able to quantify the uncertainty in our measurements Marshall University School of Medicine Three Aspects of Statistics • There are three inter-related aspects of statistics that will be discussed in this course. – Descriptive statistics. • Presenting and visualizing data. – Inferential statistics. • Generalizing from samples to populations. • Quantifying the extent to which observations can be generalized. – Hypothesis Testing. • Quantifying the extent to which experiments support stated hypotheses. Marshall University School of Medicine Why do we need a scientific approach to statistics? • Statistics is not intuitive – We see patterns in random data • (Our intuition for pattern recognition is over-developed) – Our intuitive sense of probability is often incorrect • (Our intuition for probability is under-developed) • We don’t combine probabilities correctly – We don’t appreciate that coincidences are common • Though specific coincidences are not • We need a rigorous, scientific approach to overcome these challenges Marshall University School of Medicine We see patterns in random data - - - X X - - X X - - - - X - - X - X X X - - - X - X X - - - X - X - - - - X X - - - - - - X - - - - - X X X X X - - X X - X X - - - - X X X - X X - - X - X - X - X X X X - X - X X - X X - - X - X - - - X X - - - X X - X X X - X X - - X - - - - X X - X - - - - X - X X X - - X X - X - X - X - X - - - X - - X X X - - - - X - X - - - - - - - - - - X X X - X - X X - X - - - - - X - X - - - X X X - X - - X X X X X X X - - X X - - - X - X - X - - X X X - - X - X X - - X - X X - X X X - - - X - X X - X X - - - - - X X X - - - X - - - X - X - - - X X - X X X X - X X X X - X X X X X X - X - - - X Marshall University School of Medicine Our intuition for probability is poor • Example: – Two different cancers have mortality rates of 1,286 in 100,000 and 2.414 in 100 – Which has the higher risk? – When surveyed, many people said the first was riskier Marshall University School of Medicine We don’t combine probabilities correctly • Example: • Suppose the prevalence of HIV in the population is 0.1%, and an HIV test is 99% accurate. • If a random patient tests positive, what is the chance they have HIV? Marshall University School of Medicine Combining Probabilities Example • To figure this out, complete the table: HIV+ HIV- Total Test+ 99 999 1,098 Test- 1 98,901 98,902 Total 100 99,900 100,000 • Repeat for prevalence of 1%, and 10% Marshall University School of Medicine Coincidences are common • We have all experienced coincidence – When these happen they seem extremely surprising – It is natural to want to attach significance to them • However, think about all the possible coincidences that could happen – “The strange thing about coincidence is that it doesn’t happen more often” Marshall University School of Medicine Coincidence in experimental science • In science, if we perform many tests, some will appear true just by chance • Equally natural to want to attach significance in this scenario • When performing many tests, results need to be interpreted in this context – This is counter-intuitive and technically complex Marshall University School of Medicine Challenges of statistics • Statistics can be challenging for a number of reasons – Statistics is a branch of mathematics – It uses its own - sometimes confusing - technical jargon – It requires abstract thinking – Answers to statistical questions are phrased in terms of probability, not certainty Marshall University School of Medicine Statistics is a branch of math • Gaining a full understanding of statistics requires studying complex equations – Some branches of statistics rely on understanding advanced mathematical topics • Linear algebra, calculus, and more • However, it is possible to learn to use statistics without gaining this depth of understanding • This course will aim to present an intuitive and practical approach while minimizing the mathematical content • Equations will only be presented where they are expected to enhance understanding Marshall University School of Medicine Statistical Terminology • Like any specialized field of study, statistics uses its own jargon – Words or phrases which have a different meaning in a statistical context than in a general context – Often (but not always) there are historical reasons why these words are used – Some examples: • Error, Significant, Sample, Population, Confidence, Distribution, Normal, Power, Model, ... Marshall University School of Medicine Abstract Thinking • Statistics involves abstract thinking – Sometimes the logic used in statistics is (necessarily) obscure – A major aim of statistics is to generalize from observations about a sample to inferences about a population • Involves thinking about a set of observations (on the population) that cannot actually be observed – Providing evidence (statistical significance) to support a hypothesis starts with making the assumption that the hypothesis is false • We will discuss this later in the course Marshall University School of Medicine Statistical answers are probabilistic • Answers to statistical questions are phrased in terms of probability, not in terms of certainties – We measure statistical significance by computing the probability of observing our results under certain assumed conditions – We estimate the precision of our measurements by expressing ranges and degrees of confidence • These kinds of answers can be frustrating or unsatisfactory (particularly for scientists) • The key to understanding statistical results is a solid understanding of the logic behind the techniques being used Marshall University School of Medicine Samples and Populations • Inferential statistics aims to generalize from a sample to a population. – The sample is essentially the data you collect (or the subjects from which you collect them) – The population is the collection of all data (or all possible subjects) about which you want to make inferences • Defining the population is essential to correctly interpreting statistical results. The sample must be representative of the population. Marshall University School of Medicine Samples and Populations for Clinical Trials • Defining samples and populations is particularly important in the context of clinical trials. In theory, the process should be as follows: 1. 2. 3. 4. 5. Specify the hypothesis, including the population to which it applies. Design the experiment, including a mechanism for selecting a representative sample. Collect the data. Analyze the data. Interpret the analysis. • In practice, there are always biases in selecting the sample. It is important to recognize and acknowledge these Marshall University School of Medicine Samples and Populations for Lab Experiments • Understanding the terms "sample" and "population" in the context of laboratory experiments is more subtle. – The sample is the set of all experiments you perform. – The population is the set of all equivalent experiments you could possibly perform. • The word "equivalent" in the above is important. Consider the following example. Marshall University School of Medicine Lab Experiment Example • We want to measure expression levels of a particular gene in a tumor in a mouse model, using real-time qPCR. We have a technique for inducing tumor growth in a particular strain of lab mouse, and we will perform the experiment in triplicate. What is the population in each of the following experimental designs? 1. 2. 3. 4. • We remove a tumor from a mouse, extract RNA, divide into three aliquots, and perform PCR on each We remove a tumor from a mouse, divide the tumor in three, extract RNA from each portion, and perform PCR on each We remove three tumors from a mouse, extract RNA from each tumor, and perform PCR on each We remove a tumor from each of three mice, extract RNA from each tumor, and perform PCR on each Note how the details of the experimental design dramatically affect the conclusions that can be drawn Marshall University School of Medicine Confidence Intervals Marshall University School of Medicine From Samples to Populations • Recap: – We perform an experiment, collect data, or make measurements from a sample – We want to be able to generalize these measurements to a population – A confidence interval is the key mechanism for making this generalization. Marshall University School of Medicine Example: data expressed as a proportion • In a recent study of cardiovascular disease in West Virginia, 89 overweight (25<BMI≤30) subjects who had no history of cardiovascular disease were genotyped for the SNP rs5880 in the CETP gene. Of these, 75 (84%) were homozygous (GG) while 14 (16%) were heterozygous (CG). – We would like to know the proportion of our population who are homozygous (GG) at this locus. – What is the population? – Given the data, what is our best estimate of the proportion who are homozygous-GG at this locus? – How confident are we this is representative of the true proportion? (West Virginia Medical Journal, Vol 108, January 2012) Marshall University School of Medicine Confidence Interval for Proportion Data • The population is the set of all West Virginia residents who are overweight (25<BMI≤30) and who have no history of cardiovascular disease (according to the criteria in the publication). • From our sample data, the best estimate is that 84% of this population are homozygous (GG) at the rs5880 locus. • To express our confidence in this estimate, we quote a confidence interval: We are 95% confident that the range 75.2% to 90.6% contains the true proportion of homozygous (GG) individuals in this population. Marshall University School of Medicine Understanding Confidence Intervals • The logic of a confidence interval is often subtly misunderstood. • It is not correct to say "There is a 95% chance the true population value lies between 75.2% and 90.6%”. – This implies the true population value is subject to random fluctuations. – It is the confidence interval that is subject to random fluctuations. • It's better to say "There is a 95% chance that the interval [75.2%, 90.6%] includes the true population value." Marshall University School of Medicine A statistical experiment • This statistical experiment may clarify how confidence intervals work. • Take a bag with 25 red balls and 75 black balls. We know that the true proportion of red balls in the bag is 25%, but we will try to estimate this by sampling (and computing a confidence interval). – – – – Draw 15 balls from the bag. Compute the proportion of those that are red. Calculate the 95% confidence interval for that proportion. Replace the balls, and repeat 40 times. • So we end up with 40 confidence intervals Marshall University School of Medicine Confidence Interval Experiment Results Marshall University School of Medicine Confidence Interval Experiment Results (continued) • Each time we compute a 95% confidence interval, it has a 95% chance of containing the true population value. • Since we computed 40, we'd expect, on average, 38 of these to contain the true population value. • In this case, we actually know the true population value is 0.25, so we can check. – It turns out - in this example - that only 37 of the 40 (92.5%) contain the true value. • The more repeats you do, the closer you'll get to 95% of the intervals containing the true value. Marshall University School of Medicine Different Levels of Confidence • What's special about 95%? – Nothing at all! – Commonly used only by tradition. – Could just as well compute 90% confidence intervals, 99% confidence intervals, or any other value up to 100% • Would the 90% confidence interval for our genotype data be wider or narrower than the 95% confidence interval? • What would the 100% confidence interval be for these data? – Is this helpful? Marshall University School of Medicine Confidence Intervals: Assumptions • Formulae for calculating confidence intervals (coming soon) only work if certain conditions are true. These are the assumptions in the formulae. – The sample is random (or representative) of the population. – The observations in the sample are independent of each other. • What could violate this in our genotype study? – The measurements are accurate. Marshall University School of Medicine Violating assumptions • In practice, these assumptions rarely hold exactly. – Important to minimize deviation from the assumptions. – Acknowledge any violations. – Violating the assumptions will make the confidence intervals too optimistic. • Too narrow. Marshall University School of Medicine Computing confidence intervals for proportions • Computing confidence intervals for proportions is technically complex. • No general agreement on the best way to do this. • Motulsky outlines a number of methods (page 32-35). • Modified Wald Method is probably preferred Marshall University School of Medicine Modified Wald Method for Confidence Intervals for Proportions • Suppose the sample size is n and the number of members of the sample that fall into the category of interest (these are traditionally termed successes) is S. (In our genotype example, n=89 and S=75.) 1. 2. 3. 4. • Determine z from the confidence level you want. This comes from the Normal Distribution, which we'll discuss later. Technically, if you want a confidence level of α, then z is the value such that the area under the graph of the normal distribution between -z and z is α. Equivalently, it's the value of z such that the area under the graph from 0 to z is α/2. (If you use tables to determine this, it may be listed as the value for α/2 or the value for (1+α)/2, depending on how the table is presented.) For a 95% confidence interval, z=1.96. Compute p’=(S+z)/(n+z2) Compute the margin of error, W=z √(p’(1-p’)/(n+z2)). The confidence interval runs from (p’-W) to (p’+W). Note that this interval is not symmetrical around the estimated proportion. This is particularly noticeable if the proportion is close to 0% or 100%. Marshall University School of Medicine