Download Powerpoint - Marshall University Personal Web Pages

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Marshall University School of Medicine
Department of Biochemistry and Microbiology
BMS 617
Lecture 1: Introduction, Samples and
Populations, Confidence Intervals
January 11th 2016
Marshall University Genomics Core Facility
Why Statistics?
• Why do the Biomedical Sciences need
statistics?
– We need to quantify the extent to which our
experiments are repeatable
– We need to be able to generalize from a sample to
a population
– We need to be able to quantify the uncertainty in
our measurements
Marshall University School of Medicine
Three Aspects of Statistics
• There are three inter-related aspects of statistics
that will be discussed in this course.
– Descriptive statistics.
• Presenting and visualizing data.
– Inferential statistics.
• Generalizing from samples to populations.
• Quantifying the extent to which observations can be
generalized.
– Hypothesis Testing.
• Quantifying the extent to which experiments support stated
hypotheses.
Marshall University School of Medicine
Why do we need a scientific approach
to statistics?
• Statistics is not intuitive
– We see patterns in random data
• (Our intuition for pattern recognition is over-developed)
– Our intuitive sense of probability is often incorrect
• (Our intuition for probability is under-developed)
• We don’t combine probabilities correctly
– We don’t appreciate that coincidences are common
• Though specific coincidences are not
• We need a rigorous, scientific approach to
overcome these challenges
Marshall University School of Medicine
We see patterns in random data
- - - X X - - X X - - - - X - - X - X X X - - - X - X X - -
- X - X - - - - X X - - - - - - X - - - - - X X X X X - - X
X - X X - - - - X X X - X X - - X - X - X - X X X X - X - X X - X X - - X - X - - - X X - - - X X - X X X - X X - - X
-
- - - X X - X - - - - X - X X X - - X X - X - X - X - X -
- - X - - X X X - - - - X - X - - - - - - - - - - X X X - X
- X X - X - - - - - X - X - - - X X X - X - - X X X X X X X - - X X - - - X - X - X - - X X X - - X - X X - - X - X X
- X X X - - - X - X X - X X - - - - - X X X - - - X - - - X
- X - - - X X - X X X X - X X X X - X X X X X X - X - - - X
Marshall University School of Medicine
Our intuition for probability is poor
• Example:
– Two different cancers have mortality rates of
1,286 in 100,000 and 2.414 in 100
– Which has the higher risk?
– When surveyed, many people said the first was
riskier
Marshall University School of Medicine
We don’t combine probabilities
correctly
• Example:
• Suppose the prevalence of HIV in the
population is 0.1%, and an HIV test is 99%
accurate.
• If a random patient tests positive, what is the
chance they have HIV?
Marshall University School of Medicine
Combining Probabilities Example
• To figure this out, complete the table:
HIV+
HIV-
Total
Test+
99
999
1,098
Test-
1
98,901
98,902
Total
100
99,900
100,000
• Repeat for prevalence of 1%, and 10%
Marshall University School of Medicine
Coincidences are common
• We have all experienced coincidence
– When these happen they seem extremely
surprising
– It is natural to want to attach significance to them
• However, think about all the possible
coincidences that could happen
– “The strange thing about coincidence is that it
doesn’t happen more often”
Marshall University School of Medicine
Coincidence in experimental science
• In science, if we perform many tests, some will
appear true just by chance
• Equally natural to want to attach significance
in this scenario
• When performing many tests, results need to
be interpreted in this context
– This is counter-intuitive and technically complex
Marshall University School of Medicine
Challenges of statistics
• Statistics can be challenging for a number of
reasons
– Statistics is a branch of mathematics
– It uses its own - sometimes confusing - technical
jargon
– It requires abstract thinking
– Answers to statistical questions are phrased in
terms of probability, not certainty
Marshall University School of Medicine
Statistics is a branch of math
• Gaining a full understanding of statistics requires
studying complex equations
– Some branches of statistics rely on understanding
advanced mathematical topics
• Linear algebra, calculus, and more
• However, it is possible to learn to use statistics without
gaining this depth of understanding
• This course will aim to present an intuitive and
practical approach while minimizing the mathematical
content
• Equations will only be presented where they are
expected to enhance understanding
Marshall University School of Medicine
Statistical Terminology
• Like any specialized field of study, statistics
uses its own jargon
– Words or phrases which have a different meaning
in a statistical context than in a general context
– Often (but not always) there are historical reasons
why these words are used
– Some examples:
• Error, Significant, Sample, Population, Confidence,
Distribution, Normal, Power, Model, ...
Marshall University School of Medicine
Abstract Thinking
• Statistics involves abstract thinking
– Sometimes the logic used in statistics is (necessarily)
obscure
– A major aim of statistics is to generalize from
observations about a sample to inferences about a
population
• Involves thinking about a set of observations (on the
population) that cannot actually be observed
– Providing evidence (statistical significance) to support
a hypothesis starts with making the assumption that
the hypothesis is false
• We will discuss this later in the course
Marshall University School of Medicine
Statistical answers are probabilistic
• Answers to statistical questions are phrased in terms of
probability, not in terms of certainties
– We measure statistical significance by computing the
probability of observing our results under certain assumed
conditions
– We estimate the precision of our measurements by
expressing ranges and degrees of confidence
• These kinds of answers can be frustrating or
unsatisfactory (particularly for scientists)
• The key to understanding statistical results is a solid
understanding of the logic behind the techniques being
used
Marshall University School of Medicine
Samples and Populations
• Inferential statistics aims to generalize from a
sample to a population.
– The sample is essentially the data you collect (or the
subjects from which you collect them)
– The population is the collection of all data (or all
possible subjects) about which you want to make
inferences
• Defining the population is essential to correctly
interpreting statistical results. The sample must
be representative of the population.
Marshall University School of Medicine
Samples and Populations for Clinical
Trials
• Defining samples and populations is particularly important
in the context of clinical trials. In theory, the process should
be as follows:
1.
2.
3.
4.
5.
Specify the hypothesis, including the population to which it
applies.
Design the experiment, including a mechanism for selecting a
representative sample.
Collect the data.
Analyze the data.
Interpret the analysis.
• In practice, there are always biases in selecting the sample.
It is important to recognize and acknowledge these
Marshall University School of Medicine
Samples and Populations for Lab
Experiments
• Understanding the terms "sample" and
"population" in the context of laboratory
experiments is more subtle.
– The sample is the set of all experiments you
perform.
– The population is the set of all equivalent
experiments you could possibly perform.
• The word "equivalent" in the above is
important. Consider the following example.
Marshall University School of Medicine
Lab Experiment Example
• We want to measure expression levels of a particular gene in a
tumor in a mouse model, using real-time qPCR. We have a
technique for inducing tumor growth in a particular strain of lab
mouse, and we will perform the experiment in triplicate. What is
the population in each of the following experimental designs?
1.
2.
3.
4.
•
We remove a tumor from a mouse, extract RNA, divide into three
aliquots, and perform PCR on each
We remove a tumor from a mouse, divide the tumor in three,
extract RNA from each portion, and perform PCR on each
We remove three tumors from a mouse, extract RNA from each
tumor, and perform PCR on each
We remove a tumor from each of three mice, extract RNA from each
tumor, and perform PCR on each
Note how the details of the experimental design dramatically
affect the conclusions that can be drawn
Marshall University School of Medicine
Confidence Intervals
Marshall University School of Medicine
From Samples to Populations
• Recap:
– We perform an experiment, collect data, or make
measurements from a sample
– We want to be able to generalize these
measurements to a population
– A confidence interval is the key mechanism for
making this generalization.
Marshall University School of Medicine
Example: data expressed as a
proportion
• In a recent study of cardiovascular disease in West Virginia,
89 overweight (25<BMI≤30) subjects who had no history of
cardiovascular disease were genotyped for the SNP rs5880
in the CETP gene. Of these, 75 (84%) were homozygous
(GG) while 14 (16%) were heterozygous (CG).
– We would like to know the proportion of our population who
are homozygous (GG) at this locus.
– What is the population?
– Given the data, what is our best estimate of the proportion who
are homozygous-GG at this locus?
– How confident are we this is representative of the true
proportion?
(West Virginia Medical Journal, Vol 108, January 2012)
Marshall University School of Medicine
Confidence Interval for Proportion
Data
• The population is the set of all West Virginia residents
who are overweight (25<BMI≤30) and who have no
history of cardiovascular disease (according to the
criteria in the publication).
• From our sample data, the best estimate is that 84% of
this population are homozygous (GG) at the rs5880
locus.
• To express our confidence in this estimate, we quote a
confidence interval:
We are 95% confident that the range 75.2% to 90.6%
contains the true proportion of homozygous (GG)
individuals in this population.
Marshall University School of Medicine
Understanding Confidence Intervals
• The logic of a confidence interval is often subtly
misunderstood.
• It is not correct to say "There is a 95% chance the true
population value lies between 75.2% and 90.6%”.
– This implies the true population value is subject to random
fluctuations.
– It is the confidence interval that is subject to random
fluctuations.
• It's better to say "There is a 95% chance that the
interval [75.2%, 90.6%] includes the true population
value."
Marshall University School of Medicine
A statistical experiment
• This statistical experiment may clarify how confidence
intervals work.
• Take a bag with 25 red balls and 75 black balls. We
know that the true proportion of red balls in the bag is
25%, but we will try to estimate this by sampling (and
computing a confidence interval).
–
–
–
–
Draw 15 balls from the bag.
Compute the proportion of those that are red.
Calculate the 95% confidence interval for that proportion.
Replace the balls, and repeat 40 times.
• So we end up with 40 confidence intervals
Marshall University School of Medicine
Confidence Interval Experiment
Results
Marshall University School of Medicine
Confidence Interval Experiment
Results (continued)
• Each time we compute a 95% confidence interval,
it has a 95% chance of containing the true
population value.
• Since we computed 40, we'd expect, on average,
38 of these to contain the true population value.
• In this case, we actually know the true population
value is 0.25, so we can check.
– It turns out - in this example - that only 37 of the 40
(92.5%) contain the true value.
• The more repeats you do, the closer you'll get to
95% of the intervals containing the true value.
Marshall University School of Medicine
Different Levels of Confidence
• What's special about 95%?
– Nothing at all!
– Commonly used only by tradition.
– Could just as well compute 90% confidence intervals,
99% confidence intervals, or any other value up to
100%
• Would the 90% confidence interval for our genotype data be
wider or narrower than the 95% confidence interval?
• What would the 100% confidence interval be for these
data?
– Is this helpful?
Marshall University School of Medicine
Confidence Intervals: Assumptions
• Formulae for calculating confidence intervals
(coming soon) only work if certain conditions
are true. These are the assumptions in the
formulae.
– The sample is random (or representative) of the
population.
– The observations in the sample are independent
of each other.
• What could violate this in our genotype study?
– The measurements are accurate.
Marshall University School of Medicine
Violating assumptions
• In practice, these assumptions rarely hold
exactly.
– Important to minimize deviation from the
assumptions.
– Acknowledge any violations.
– Violating the assumptions will make the
confidence intervals too optimistic.
• Too narrow.
Marshall University School of Medicine
Computing confidence intervals for
proportions
• Computing confidence intervals for
proportions is technically complex.
• No general agreement on the best way to do
this.
• Motulsky outlines a number of methods (page
32-35).
• Modified Wald Method is probably preferred
Marshall University School of Medicine
Modified Wald Method for Confidence
Intervals for Proportions
• Suppose the sample size is n and the number of members of the sample
that fall into the category of interest (these are traditionally termed
successes) is S. (In our genotype example, n=89 and S=75.)
1.
2.
3.
4.
•
Determine z from the confidence level you want. This comes from the
Normal Distribution, which we'll discuss later. Technically, if you want a
confidence level of α, then z is the value such that the area under the graph
of the normal distribution between -z and z is α. Equivalently, it's the value
of z such that the area under the graph from 0 to z is α/2. (If you use tables
to determine this, it may be listed as the value for α/2 or the value for
(1+α)/2, depending on how the table is presented.) For a 95% confidence
interval, z=1.96.
Compute p’=(S+z)/(n+z2)
Compute the margin of error, W=z √(p’(1-p’)/(n+z2)).
The confidence interval runs from (p’-W) to (p’+W).
Note that this interval is not symmetrical around the estimated
proportion. This is particularly noticeable if the proportion is close to 0%
or 100%.
Marshall University School of Medicine