Download Sample Slide Heading Image

Document related concepts

Law of large numbers wikipedia , lookup

Foundations of statistics wikipedia , lookup

Transcript
Advanced Statistics
for Interventional
Cardiologists
What you will learn
•
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Introduction
Basics of multivariable statistical modeling
Advanced linear regression methods
Hands-on session: linear regression
Bayesian methods
Logistic regression and generalized linear model
Resampling methods
Meta-analysis
Hands-on session: logistic regression and meta-analysis
Multifactor analysis of variance
Cox proportional hazards analysis
Hands-on session: Cox proportional hazard analysis
Propensity analysis
Most popular statistical packages
Conclusions and take home messages
1st day
2nd day
Resampling
•
•
•
•
Basic concepts and applications
Bootstrap
Jacknife
Other approaches, including Monte Carlo,
Markov chain, and Gibbs sampling
methods
Resampling
•
•
•
•
Basic concepts and applications
Bootstrap
Jacknife
Other approaches, including Monte Carlo,
Markov chain, and Gibbs sampling
methods
Major breakthroughs since the 70’s
Major breakthroughs since the 70’s
in cardiology
Major breakthroughs since the 70’s
in cardiology
in statistics*
*source: American Statistical Association
Samples and populations
This is a sample
Samples and populations
And this is its
universal population
Samples and populations
This is another sample
Samples and populations
And this might be its
universal population
Samples and populations
But what if THIS is its
universal population?
Samples and populations
Resampling is based on the
use of repeat samples from
the original one to make
inferences on the target
population with few
assumptions
Samples and populations
1
1
3
5
3
1
2
3
4
1
2
3
4
5
Resampling is based on the use of
repeat samples from the original one
to make inferences on the target
population with few assumptions
Samples and populations
1
1
1
3
4
2
3
5
1
2
3
4
1
3
4
5
2
2
4
Resampling is based on the use of
repeat samples from the original one
to make inferences on the target
population with few assumptions
1
2
3
5
Samples and populations
1
1
1
2
3
5
4
3
1
2
3
4
1
3
4
5
2
2
4
3
3
4
5
5
Resampling is based on the use of
repeat samples from the original one
to make inferences on the target
population with few assumptions
1
2
3
5
1
2
4
5
Samples and populations
1
1
1
2
3
5
4
3
1
2
3
4
1
3
4
5
2
2
4
3
3
2
5
5
4
5
3
5
5
Resampling is based on the use of
repeat samples from the original one
to make inferences on the target
population with few assumptions
1
2
3
5
1
4
2
5
1
3
4
5
Samples and populations
1
2
2
3
3 replacement
3
1 Random resampling with
2
5
4
3
4
4
5
5
5
5
5
3
1
1
2
3
4
1
2
3
4
5
Resampling is based on the use of
repeat samples from the original one
to make inferences on the target
population with few assumptions
1
1
2
1
2
Random resampling
without replacement 3
3
4
5
4
5
5
Resampling
• Resampling refers to the use of the observed
data or of a data generating mechanism (such
as a die or computer-based simulation) to
produce new hypothetical samples, the
results of which can then be analyzed.
• The term computer-intensive methods also
is frequently used to refer to techniques such
as these…
Resampling – the hot issue
• Validity of resampling processes depends
on sample size and selection:
 each individual (or item) in the population
must have an equal probability of being
selected
 no individual (item) or class of individuals
may be discriminated against
(Pseudo-) random numbers
• Items selected during resampling procedures
are often identified by relying on random or
pseudo-random numbers.
• Pseudo-random numbers are apparently
random numbers generated by specific
algorithms that can be repeatedly generated
using the same procedure and settings.
Pseudo-random numbers are used very often
in statistics.
(Pseudo-) random numbers
Desirable properties for random number generators are:
• Randomness: provide a sequence of uniformly distributed
random numbers.
• Long period: one requests the ability to produce, without
repeating the initial sequence, all of the random variables of a
huge sample that current computer speeds allow for.
• Computational efficiency: the execution should be rapid.
• Repeatability: initial conditions (seed values) completely
determine the resulting sequence of random variables.
• Portability: identical sequences of random variables may be
produced on a wide variety of computers (for given seed values).
• Homogeneity: all subsets of bits of the numbers are random,
from the most- to the least-significant bits.
Cross-validation
• The first and simplest form of resampling is crossvalidation, eg splitting of the original sample in
two halves which are separately analyzed.
• More formally, it consists in the partitioning a
sample of data into subsets such that the analysis
is initially performed on a single subset, while the
other subset(s) are retained for subsequent use in
confirming and validating the initial analysis.
• The initial subset of data is called the training or
derivation set.
• The other subset(s) are called validation or
testing sets.
Resampling
•
•
•
•
Basic concepts and applications
Bootstrap
Jacknife
Other approaches, including Monte Carlo,
Markov chain, and Gibbs sampling
methods
Bootstrap
Bootstrap
• The bootstrap is a modern, computer-intensive,
general purpose approach to statistical inference,
falling within a broader class of resampling methods.
• Bootstrapping is the practice of estimating properties
of an estimator (such as its variance) by measuring
those properties when sampling from an
approximating distribution. One standard choice for
an approximating distribution is the empirical
distribution of the observed data.
Bootstrap
• In the case where a set of observations can be
assumed to be from an independent and identically
distributed population, this can be implemented by
constructing a number of resamples of the observed
dataset (and of equal size to the observed dataset),
each of which is obtained by random sampling with
replacement from the original dataset.
• It may also be used for constructing hypothesis tests.
It is often used as an alternative to inference based
on parametric assumptions when those assumptions
are in doubt, or where parametric inference is
impossible or requires very complicated formulas for
the calculation of standard errors.
Bootstrap
• The advantage of bootstrapping over analytical methods
is its great simplicity:
 it is straightforward to apply the bootstrap to derive estimates
of standard errors and confidence intervals for complex
estimators of complex parameters of the distribution, such as
percentile points, proportions, odds ratio, and correlation
coefficients.
• The disadvantage of bootstrapping is:
 while (under some conditions) it is asymptotically consistent, it
does not provide general finite-sample guarantees, and
has a tendency to be overly optimistic;
 the apparent simplicity may conceal the fact that important
assumptions are being made when undertaking the bootstrap
analysis (e.g. independence of samples) where these would
be more formally stated in other approaches.
Efron’s explanations
Efron’s explanations
Efron’s explanations
Efron’s explanations
Bootstrap: how many samples
Increasing the number of bootstrap
samples is like increasing the
resolution of an image
Bootstrap: how many samples
Increasing the number of bootstrap
samples is like increasing the
resolution of an image
Bootstrap: how many samples
Increasing the number of bootstrap
samples is like increasing the
resolution of an image
Bootstrap: how many samples
Bootstrap: how many samples
• For standard deviation/error, most practitioners suggest
that the number of bootstrap samples (B) should be
around 200.
• For 95% confidence intervals, B should be between 1000
and 2000.
• Further, estimating a confidence interval usually requires
estimating the 100α percentile of the bootstrap distribution.
To do this, the bootstrap sample is first sorted into
ascending order. Then, if α(B+1) is an integer, the
percentile is estimated by the α(B+1)th member of the
ordered bootstrap sample. Otherwise, interpolation must
be used, between the [α(B+1)]th and ([a(B+1)]+1)th
members of the ordered sample, where [ ] denotes the
integer part. Consequently, choosing B=999 or B=1999
leads to simple calculations for the common choices of α.
Bootstrap: which type of resampling?
• Bootstrap with non-parametric
resampling: makes no assumptions concerning
the distribution of, or model for, the data.
• Bootstrap with parametric resampling:
we assume that a parametric model for the data is
known up to the unknown parameter vector.
• Bootstrap with semi-parametric
resampling: a variant of parametric resampling,
appropriate for some forms of regression.
Bootstrap: which type of resampling?
Then, which one to choose?
• Parametric and non-parametric simulation make very different
assumptions. The general principle that is that the simulation
process should mirror as closely as possible the process that
gave rise to the observed data. Thus, if we believe a particular
model (ie we believe that the fitted model differs from the true
model only because true values of the parameters have been
replaced by estimates obtained from the data), then the
parametric (or in regression, preferably semi parametric)
bootstrap is appropriate.
• However, examination of the residuals may cast doubt on the
modelling assumptions. In this case, non-parametric simulation
is often appropriate. It is interesting to note that, in practice,
non-parametric simulation gives results that generally mimic the
results obtained under the best fitting, not the simplest
parametric model.
Bootstrap and beyond
Example: bootstrapping a
logistic regression analysis
Predictors selected in 25 bootstrap replications for the cohort study
(n=155; 33 with events), based on forward stepwise logistic regression.
The predictors selected by the actual data were variables 13, 15, 7, 20.
15 16 3
15 20 4
16 13 2 19
18 20 3
13 15 20
15 13
15 20 7
13
15
13 14
12 20 18
2 20 15 7 19 12
13 20 15 19
13 7 20 15
13 19 6
20 16 19
20 19
14 18 7 16 2
18 20 7 11
20 19 15
20
13 12 15 8 18 7 19
15 13 19
13 4
12 15 3
A pivotal work on bootstrap
Bootstrap for internal validation
Bootstrap for internal validation
Internal validation of predictors generated by
multivariable logistic regression analysis was
performed by means of bootstrapping techniques,
with 1000 cycles and generation of OR and biascorrected 95% CI.14
14. Steyerberg EW, Harrell FE Jr, Borsboom GJJM, Eijkemans MJCR,
Vergouwe Y, Habbema JDF. Internal validation of predictive models:
efficiency of some procedures for logistic regression analysis. J Clin
Epidemiol 2001;54:774–781.
Resampling
•
•
•
•
Basic concepts and applications
Bootstrap
Jacknife
Other approaches, including Monte Carlo,
Markov chain, and Gibbs sampling
methods
Jacknife
Jacknife
• Jacknifing is a resampling method based
on the creation of several subsamples
by excluding a single case at the time.
• Thus, the are only N jacknife samples for
any given original sample with N cases.
• After the systematic recomputation of the
statistic estimate of choice is completed,
an point estimate and an estimate for the
variance of the statistic can be calculated.
Bootstrap vs jacknife: the winner is…
Jacknife is an acceptable approximation
of bootstrap only for linear statistics
Resampling
•
•
•
•
Basic concepts and applications
Bootstrap
Jacknife
Other approaches, including Monte Carlo,
Markov chain, and Gibbs sampling
methods
Do you like Monte Carlo?
Monte Carlo resampling
• Monte Carlo methods are a class of computational
algorithms that rely on repeated random sampling to
compute their results. They provide approximate solutions to
mathematical problems by sampling experiments.
• Because of their reliance on repeated computation and
random or pseudo-random numbers, Monte Carlo methods
are most suited to calculation by a computer, being
chosen when it is infeasible or impossible to compute an
exact result with a deterministic algorithm.
• These simulation methods are especially useful in studying
systems with a large number of coupled degrees of
freedom, eg fluids, social behaviors, hierarchical models, ….
Monte Carlo resampling
• More broadly, Monte Carlo methods are useful for
modeling phenomena with significant uncertainty in
inputs (eg Bayesian statistical analyses).
• The term Monte Carlo was coined in the 1940s by physicists
working on nuclear weapon projects in the Los Alamos
(Ulam, Fermi, von Neumann, and Metropolis). The name is
a reference to the Monte Carlo Casino in Monaco where
Ulam's uncle would borrow money to gamble.
• The use of randomness and the repetitive nature of the
process are analogous to the activities conducted at a
casino.
Impact of Monte Carlo methods
• Monte Carlo simulation methods do not always require
truly random numbers to be useful — while for some
applications unpredictability is vital.
• Many of the most useful techniques use deterministic,
pseudo-random number sequences, making it easy to
test and re-run simulations.
• Monte Carlo resampling simulation takes the mumbojumbo out of statistics and enables even beginning
students to understand completely everything that is done.
• The application of Monte Carlo methods in teaching
statistics also is not new.
Impact of Monte Carlo methods
• What is new and radical is using Monte Carlo methods
routinely as problem-solving tools for everyday
problems in probability and statistics.
• Computationally intensive, but conceptually simple,
methods belong at the forefront, whereas traditional
analytical simplifications loose in importance.
• Monte Carlo simulations are not only relevant for
simulating models of interest, but they constitute also a
valuable tool for approaching statistics.
Markov chain
• In mathematics, a Markov chain, named after Andrey
Markov, is a discrete-time stochastic process with the
Markov property.
• Having the Markov property means for a given process that
knowledge of the previous states is irrelevant for
predicting the probability of subsequent states.
• In this way a Markov chain is "memoryless", no given state
has any causal connection with a previous state.
• At each point in time the system may have changed states
from the state the system was in the moment before, or the
system may have stayed in the same state. The changes of
state are called transitions. If a sequence of states has the
Markov property, then every future state is conditionally
independent of every prior state.
Markov chain
• The PageRank of a webpage as used by Google is
defined by a Markov chain.
• Markov chain methods have also become very
important for generating sequences of random
numbers to accurately reflect very complicated
desired probability distributions, via a process
called Markov chain Monte Carlo (MCMC).
• In recent years this has revolutionised the
practicability of Bayesian inference methods,
allowing a wide range of posterior distributions to
be simulated and their parameters found
numerically.
Markov chain Monte Carlo (MCMC)
• Markov chain Monte Carlo (MCMC) methods, are a class of
algorithms for sampling from probability distributions based
on constructing a Markov chain that has the desired
distribution as its stationary distribution.
• The state of the chain after a large number of steps is then
used as a sample from the desired distribution. The quality
of the sample improves as a function of the number of
steps.
• Usually it is not hard to construct a MCMC with the desired
properties. The more difficult problem is to determine how
many steps are needed to converge to the stationary
distribution within an acceptable error.
Metropolis-Hastings algorithm
• The Metropolis-Hastings algorithm is a method for
creating a Markov chain that can be used to
generate a sequence of samples from a probability
distribution that is difficult to sample from directly.
• This sequence can be used MCMC simulation to
approximate the distribution (i.e., to generate a
histogram), or to compute an integral (such as an
expected value).
• The Gibbs sampling algorithm is a special case of
the Metropolis-Hastings algorithm which is usually
faster and easier to use but is less generally
applicable in physics (but actually is the most
common in Bayesian analysis).
Gibbs sampler
• Gibbs sampling is an algorithm to generate a sequence of
samples from the joint probability distribution of two or more
random variables.
• The purpose of such a sequence is to approximate the joint
distribution, or to compute an integral (such as an expected
value).
• Gibbs sampling is a special case of the Metropolis-Hastings
algorithm , and thus an example of a Markov chain Monte
Carlo algorithm (MCMC). More specifically, it is a specific
form of MCMC in which values Xn at successive sites n are
updated using the full conditional distribution of Xn given the
values Xm at all other sites m = n. It is also known as the
heat bath sampler. Successive sites n may be chosen
systematically or randomly.
• It is also known as the heat bath sampler.
One of my favourite articles
One of my favourite articles
A pivotal work based
on Monte Carlo Methods
Concato et al, JCE 1995
Softwares
• BoxSampler (add-in for Excel)
•C
• DDXL
•R
• Fortran
• Lisp
• Pascal
• Resampling Stats (add-in for Excel)
• RiskAMP
•S
•S-Plus
•SAS
• SimulAr
• StatsXact
Questions?
Take home messages
• Resampling methods are being increasingly used given
the major breakthroughs in computational power
• Among cross-validation, jacknife, and bootstrap, the latter
is the most powerful and robust tool for statistical inference
and validation.
• Monte Carlo simulation methods have become more and
more common for estimating the bias of different statistical
procedures and for complex Bayesian models.
• In all but very exceptional cases, resampling methods are
best left in the hands of a statistical professional, working
together with the clinician in order to achieve the clinician’s
goal safeguarding validity
And now a brief break…
For further slides on these topics
please feel free to visit the
metcardio.org website:
http://www.metcardio.org/slides.html