Download P-value

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Probability and
Statistical Inference
Gehlbach: Chapter 8
Objective of Statistical Analysis
 To answer research questions using observed
data, using data reduction and analyzing
variability
 To make an inference about a population based
on information contained in a sample from that
population
 To provide an associated measure of how good
the inference is
Basic Concepts of Statistics
Estimation & Inference
Method of testing hypotheses
Based on statistics: f(X)
Population
: Parameters
Sample
X : Sample values
Sampling
Method of collecting data
Based on probability
General Approach to Statistical Analysis
Population Distribution
Random variables-Parameters: µ, σ
Sampling
Samples of size N
generate data
Descriptive Statistics (figures, tables)
Estimation: statistics X , SD
Statistical
Tests of Hypothesis
Inference
Inference about the population
Outline
• Probability
•
•
•
•
Definition
Probability Laws
Random Variable
Probability Distributions
• Statistical Inference
•
•
•
•
•
•
•
•
•
•
•
Definition
Sample vs. Population
Sampling Variability
Sampling Problems
Central Limit Theorem
Hypothesis Testing
Test Statistics
P-value Calculation
Errors in Inference
P-value Adjustments
Confidence Intervals
We disagree with Stephen
• A working understanding of P-values is
not difficult to come by.
all parts
• For the most part, Statistics and clinical
research can work well together.
• Good collaborations result when
researchers have some knowledge of
design and analysis issues
Probability
Probability and the P-value
• You need to understand what a P-value
means
• P-value represents a probabilistic
statement
• Need to understand concept of
probability distributions
• More on P-values later
Definition of Probability
• An experiment is any process by which an
observation is made
• An event (E or Ei) is any outcome of an
experiment
• The sample space (S) is the set of all possible
outcomes of an experiment
• Probability: a measure based on the sample S;
in the simplest case is empirically estimated by
# times event occurs / total # trials
E.g.: Pr(of a red car) = (# red cars seen) / (total # cars)
• Probability is the basis for statistical inference
Axiomatic Probability
(laying down “the laws”)
For any sample space S containing events E1,
E2, E3,…; we assign a number, P(Ei), called
the probability of Ei such that:
1. 0 ≤ P(Ei) ≤ 1
2. P(S) = 1
3. If E1, E2, E3,…are pairwise mutually
exclusive events in S then

P(E1  E2  E3 ...)   P(Ei )
i1
Union and Intersection:
Venn Diagrams
Union of E1 and E2: “E1 or E2”, denoted E1UE2:
E1
E2
Intersection of E1 and E2: “E1 and E2”, denoted E1∩E2:
E1
E2
Laws of Probability
(the sequel)
• Let E (“E complement”) be set of events in S not
in E, then P( E )= 1-P(E)
• P(E1U E2) = P(E1) + P(E2) – P(E1∩ E2)
• The conditional probability of E1 given E2 has
occurred:
P(E1  E2 )
P(E1 | E2 ) 
P(E2 )
• Events E1 and E2 are independent if
P(E1∩E2) = P(E1)P(E2)
Conditional Probability
• Restrict yourself to a “subspace” of the
sample space
Male
Female
Infection
20%
10%
No infection
35%
35%
● P(I|M) = P(I∩M)/P(M) = 0.2/0.55 = 0.36
● P(M|I) = P(I∩M)/P(I) = 0.2/0.3 = 0.67
Conditional probability examples
• Categorical data analysis:
odds ratio = ratio of odds of two conditional
probabilities
• Survival analysis, conditional probabilities of
the form :
P(alive at time t1+t2 | survive to t1)
Random Variables
(where the math begins)
• A random variable is a (set) function with domain
S and range  (i.e., a real-valued function
defined over a sample space)
• E.g.: tossing a coin, let X=1 of heads, X=0 if tails
– P(X=0) = P(X=1) = ½
– Many times the random variable of interest
will be the realized value of the experiment
(e.g., if X is the b-segment PSV from RDS)
– Random variables have probability
distributions
Probability Distributions
Two types:
 Discrete distributions (and discrete
random variables) are represented by a
finite (or countable) number of values
P(X=x) = p(x)
 Continuous distributions (and random
variables) are be represented by a realvalued interval
P(x1<X<x2) = F(x2) – F(x1)
Expected Value & Variance
• Random variables are typically described
using two quantities:
– Expected value = E(X) (the mean, usually “μ”)
– Variance = V(X) (usually “σ2”)
• Discrete Case:
E(X) = xip(xi )
V(X) =
i
2
[x
E(x)]
p(xi )
 i
i
• Continuous Case:

E(X)   x f(x) dx
-

V(X)   (x - μ)2 f(x) dx
-
Discrete Distribution Example
Binomial:
–
–
–
–
–
Experiment consists of n identical trials
Each trial has only 2 outcomes: success (S) or failure (F)
P(S) = p for a single trial; P(F) = 1-p = q
Trials are independent
R.V. X = the number of successes in n trials
n x
p(x)    p (1  p)n x
x
Continuous Distribution Example
Normal (Gaussian):
• The normal distribution is defined by its
probability density function, which is given
as
 x  μ
f(x) 
exp 
2
2
2πσ
 2σ
1
2

,    x  

for parameters μ and σ, where σ > 0.
X ~ N(μ, σ2), E(X) = μ and V(X) = σ2
Same Variance
Different Means
f(x)
X ~ N(1, 2)
X ~ N(2, 2)
1
2
X
Same Mean
Different Variances
f(x)
X ~ N(, 12)
X ~ N(, 22)

X
Statistical
Inference
Statistical Inference
• Is there a difference in the population?
• You do not know about the population. Just the
sample you collected.
• Develop a Probability model
• Infer characteristics of a population from a
sample
• How likely is it that sample data support null
hypothesis
Statistical Inference
Inference
Mean = 16.2
Mean = ?
Sample
Population
Definition of Inference
• Infer a conclusion/estimate about a
population based on a sample from the
population
• If you collect data from whole population
you don’t need to infer anything
• Inference = conducting hypothesis tests
(for p-values), estimating 95% CI’s
Sample vs. Population (example)
• “The primary sample [involved] students in the 3rd
through 5th grades in a community bordering a major
urban center in North Carolina… The sampling frame for
the study was all third through fifth-grade students
attending the seven public elementary schools in the
community (n=2,033). From the sampling frame, school
district evaluation staff generated a random sample of
700 students.”
Source: Bowen, NK. (2006) Psychometric properties of
Elementary School Success Profile for Children. Social
Work Research, 30(1), p. 53.
Philosophy of Science
• Idea: We posit a paradigm and attempt to
falsify that paradigm.
• Science progresses faster via attempting to
falsify a paradigm than attempting to
corroborate a paradigm.
(Thomas S. Kuhn. 1970. The Structure of
Scientific Revolutions. University of Chicago
Press.)
Philosophy of Science
•
Easier to collect evidence to contradict
something than to prove truth?
•
The fastest way to progress in science under a
paradigm of falsification is through perturbation
experiments.
•
In epidemiology,
– often unable to do perturbation experiments
– it becomes a process of accumulating
evidence
•
Statistical testing provides a rigorous data-driven
framework for falsifying hypothesis
What is Statistical Inference?
• A generalization made about a larger
group or population from the study of a
sample of that population.
• Sampling variability: repeat your study
(sample) over and over again. Results
from each sample would be different.
Sampling Variability
Inference
Mean = 16.2
Mean = ?
Sample
Population
Sampling Variability
Inference
Mean = 17.1
Mean = ?
Sample
Population
Sampling Problems
• Low Response Rate
• Refusals to Participate
• Attrition
Low Response Rate
• Response rate = % of targeted sample
that supply requested information
• Statistical inferences extend only to
individuals who are similar to completers
• Low response rate ≠ Nonresponse bias,
but is a possible symptom
Low Response Rate (examples)
•
“One hundred six of the 360 questionnaires were returned, a response rate of 29%.”
Source: Nordquist, G. (2006) Patient insurance status and do-not-resuscitate orders:
Survival of the richest? Journal of Sociology & Social Welfare, 33(1), p. 81.
•
“At the 7th week, we sent a follow-up letter to thank the respondents and to remind the
nonrespondents to complete and return their questionnaires. The follow-up letter
generated 66 additional usable responses.”
Source: Zhao JJ, Truell AD, Alexander MW, Hill IB. (2006) Less success than meets the
eye? The impact of Master of Business Administration education on graduates’ careers.
Journal of Education for Business, 81(5), p. 263.
•
“The response rate, however, was below our expectation. We used 2 procedures to
explore issues related to non-response bias. First, there were several identical items that
we used in both the onsite and mailback surveys. We compared the responses of the nonrespondents to those of respondents for [both surveys]. No significant differences between
respondents and non-respondents were observed. We then conducted a follow-up
telephone survey of non-respondents to test for potential non-response bias as well as to
explore reasons why they had not returned their survey instruments…”
Source: Kyle GT, Mowen AJ, Absher JD, Havitz ME. (2006) Commitment to public leisure
service providers: A conceptual and psychometric analysis. Journal of Leisure Research,
38(1), 86-87.
Refusals to Participate
• Similar kind of problem to having low
response rates
• Statistical inferences may extend only to
those who agreed to participate, not to all
asked to participate
• Compare those who agree to refusals
Refusals to Participate (example)
• “Participants were 38 children aged between 7 and 9 years.
Children were from working- or middle-class backgrounds, and
were drawn from 2 primary schools in the north of England.
Letters were sent to the parents of all children between 7 and 9
in both schools seeking consent to participate in the study.
Around 40% of the parents approached agreed for their children
to take part.”
Source: Meins E, Fernyhough C, Johnson F, Lidstone J. (2006)
Mind-mindedness in children: Individual differences in internalstate talk in middle childhood. British Journal of Developmental
Psychology, 24(1), p. 184.
Attrition
• Individuals who drop out before study’s
end (not an issue for every study design)
• Differences between those who drop out
and those who stay in are called Attrition
bias.
• Conduct follow-up study on dropouts
• Compare baseline data
Attrition (example)
•
“…Of the 251 men who completed an assigned intervention, about a fifth
(19%) failed to return for a 1-month assessment and more than half (54%)
for a 3-month assessment… Conclusions also cannot be generalized
beyond the sample [partly because] attrition in the evaluation study was
relatively high and it was not random. Therefore, findings cannot be
generalized to those least likely to complete intervention sessions or followup assessments.”
Source: Williams ML, Bowen AM, Timpson SC, Ross MW, Atkinson JS.
(2006) HIV prevention and street-based male sex workers: An evaluation of
brief interventions. AIDS Education & Prevention, 18(3), pp.207-214.
•
“The 171 participants who did not return for their two follow-up visits
represent a significant attrition rate (34%). A comparison of demographic
and baseline measures indicated that [those who stayed in the study versus
those who did not] differed on age, BMI, when diagnosed, language,
ethnicity, HbA1c, PCS, MCS and symptoms of depression (CES-D).”
Source: Maljanian R, Grey N, Staff I, Conroy L. (2005) Intensive telephone
follow-up to a hospital-based disease management model for patients with
diabetes mellitus. Disease Management, 8(1), p. 18.
Back to Inference….
Motivation
• Typically you want to see if there are differences
between groups (i.e., Treatment vs. Control)
• Approach this by looking at “typical” or
“difference on average” between groups
• Thus we look at differences in central tendency
to quantify group differences
• Test if two sample means are different
(assuming same variance) in experiment
Same Variance
Different Means
f(x)
X ~ N(1, 2)
X ~ N(2, 2)
1
2
X
Central Limit Theorem
• The CLT states that regardless of the
distribution of the original data, the
average of the data is Normally distributed
• Why such a big deal?
• Allows for hypothesis testing (p-values)
and CI’s to be estimated
Central Limit Theorem
• If a random sample is drawn from a population,
a statistic (like the sample average) follows a
distribution called a “sampling distribution”.
• CLT tells us the sampling distribution of the
average is a Normal distribution, regardless of
the distribution of the original observations, as
the sample size increases.
P-value = 0.164
f(x)
X ~ N(C, 2)
X ~ N(T, 2)
C
T
# of Infections
What is the P-value?
• The p-value represents the probability of getting a test
statistic as extreme or more under the null hypothesis
• That is, the p-value is the chances you obtained your
data results under the assumption that your null
hypothesis is true.
• If this probability is low (say p<0.05), then you
conclude your data results do not support the null
being true and “reject the null hypothesis.”
Hypothesis Testing & P-value
• P-value is:
Pr(observed data results | null hypothesis is true)
• If P-value is low, then conclude null
hypothesis is not true and reject the null
(“in data we trust”)
• How low is low?
Statistical Significance
If the P-value is as small or smaller
than the pre-determined Type I error
(size) , we say that the data are
statistically significant at level .
What value of  is typically assumed?
Probability Distribution & P-value
f(x)
Fail to reject H0
Reject H0
Critical limit
Critical region
4.4
4.7
5.0
5.3
5.6
Mean
# Infections
2-sided P-value & Probability Distribution
f(x)
Reject H0
Fail to reject H0
Critical limit
Critical limit
Critical region
4.4
Reject H0
Critical region
4.7
5.0
5.3
5.6
Mean
# Infections
Why P-value < 0.05 ?
This arbitrary cutoff has evolved over
time as somewhat precedent.
In legal matters, courts typically require
statistical significance at the 5% level.
The P-value
The P-value is a continuum of evidence
against the null hypothesis.
Not just a dichotomous indicator of
significance.
Would you change your standard of care
surgery procedure for p=0.049999 vs.
p=0.050001?
Gehlbach’s beefs with P-value
• Size of P-value does not indicate the
[clinical] importance of the result
• Results may be statistically significant but
practically unimportant
• Differences not statistically significant are
not necessarily unimportant ***
• Any difference can become statistically
significant if N is large enough
• Even if there is statistical significance is
there clinical significance?
Controversy around
HT and P-value
“A methodological culprit responsible for spurious
theoretical conclusions”
(Meehl, 1967; see Greenwald et al, 1996)
“The p-value is a measure of the credibility of the null
hypothesis. The smaller the P-value is, the less
likely one feels the null hypothesis can be true.”
HT and p-value
• “It cannot be denied that many journal editors
and investigators use P-value < 0.05 as a
yardstick for the publishability of a result.”
• “This is unfortunate because not only P-value,
but also the sample size and magnitude of a
physically important difference determine the
quality of an experimental finding.”
HT and p-value
• “[We] endorse the reporting of estimation
statistics (such as effect sizes, variabilities,
and confidence intervals) for all important
hypothesis tests.”
– Greenwald et al (1996)
Test Statistics
• Each hypothesis test has an associated
test statistic.
• A test statistic measures compatibility
between the null hypothesis and the data.
• A test statistic is a random variable with a
certain distribution.
• A test statistic is used to calculate
probability (P-value) for the test of
significance.
How a P-value is calculated
• A data summary statistic is estimated (like the sample mean)
• A “test” statistic is calculated which relates the data summary
statistic to the null hypothesis about the population parameter
(the population mean)
• The observed/calculated test statistic is compared to what is
expected under the null hypothesis using the Sampling
Distribution of the test statistic
• The Probability of finding the observed test statistic (or more
extreme) is calculated (this is the P-value)
Hypothesis Testing
1. Set up a null and alternative hypothesis
2. Calculate test statistic
3. Calculate the P-value for the test
statistic
4. Based on P-value make a decision to
reject or fail to reject the null hypothesis
5. Make your conclusion
Errors in
Statistical Inference
The Four Possible Outcomes
in Hypothesis Testing
Decision based
on Data
Truth in Population
H0 true
H0 false
Fail to
reject H0
H0 is true &
H0 is not rejected
H0 is false &
H0 is not rejected
Reject H0
H0 is true &
H0 is rejected
H0 is false &
H0 is rejected
Note similarities to diagnostic tests!
The Four Possible Outcomes
in Hypothesis Testing
TRUTH
DECISION
H0 true
Fail to
reject H0
Reject H0
H0 false
Type II error
()
Type I error
( )
Power
(1-)
Conditioned on column!
Type I Errors
 = Pr(Type I error)
= Pr(reject H0 | H0 is true)
“Innocent until proven guilty”
Rejected innocence but defendant is
truly innocent (concluded guilty).
Type II Errors
 = Pr(Type II error)
= Pr(do not reject H0 | H0 is false)
“Innocent until proven guilty”
Do not reject innocence but
defendant was truly guilty.
(conclude innocent).
P-value
adjustments
P-value adjustments
• Sometimes adjustments for multiple testing are
made
• Bonferroni α = (alpha) / (# of tests)
• alpha is usually 0.05 (P-value cutoff)
• Bonferroni is a common (but conservative)
adjustment; many others exist
P-value adjustments (example)
• “An alpha of .05 was used for all statistical tests. The
Bonferroni correction was used, however, to reduce
the chance of committing a Type I error. Therefore,
given that five statistical tests were conducted, the
adjusted alpha used to reject the null hypothesis was
.05/5 or alpha = .01.”
Source: Cumming-McCann A. (2005) An investigation
of rehabilitation counselor characteristics, white racial
attitudes, and self-reported multicultural counseling
competencies. Rehabilitation Counseling Bulletin,
48(3), 170-171.
Confidence
Intervals
(CI’s)
Confidence Intervals
• What is the idea of confidence interval?
Calculate a range of reasonable values
(an interval) that should include the
population value (point estimate) 95% of
the time if you were to collect sample data
over and over again.
Confidence Intervals
95% Confidence
(

|
|
)
In other words, if 100 different samples were drawn
from the same population and 100 intervals were
calculated, approximately 95 of them would contain
the population mean.
Confidence Intervals
• 100*(1-α)% Confidence Interval for Mean:
sd
X  tdf n1, 1 2  
n
• 100*(1-α)% Confidence Interval for
Proportion:
pˆ 1  pˆ 
pˆ  z1 2 
n
95% Confidence Intervals
• 95% Confidence Interval for Mean:
sd
X  2
n
• 95% Confidence Interval for Proportion:
pˆ 1  pˆ 
pˆ  2 
n
Bayesian vs. Classical Inference
• There are 2 main camps of Statistical Inference:
– Frequentist (classical) statistical inference
– Bayesian statistical inference
• Bayesian inference incorporates “past knowledge” about
the probability of events using “prior probabilities”
• Bayesian paradigm assumes parameters of interest
follow a statistical distribution of their own; Frequentist
inference assumes parameters are fixed
• Statistical inference is then performed to ascertain what
the “posterior probability” of outcomes are, depending on:
– the data
– the assumed prior probabilities
Schedule
Seminar
#
Topic
Date
Time
1
Study design and data collection
9/10
1:30 – 3:00
2
Probability and statistical inference
9/17
2:00 – 4:00
3
Data summary measures and graphical display of
results*
10/1
2:00 – 4:00
4
Survey of statistical analysis techniques (part I)
10/8
2:00 – 4:00
5
Survey of statistical analysis techniques (part II)
10/15
2:00 – 4:00
6
Evidence-based medicine and decision analysis
11/5
2:00 – 4:00
7
Reading and reviewing analyses in medical
literature*
11/19
2:00 – 4:00
8
Review of student-selected medical publications*
12/3
2:00 – 4:00
*10/01 seminar will meet in Wachovia 2314