Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Advanced Statistics for Interventional Cardiologists What you will learn • • • • • • • • • • • • • • • Introduction Basics of multivariable statistical modeling Advanced linear regression methods Hands-on session: linear regression Bayesian methods Logistic regression and generalized linear model Resampling methods Meta-analysis Hands-on session: logistic regression and meta-analysis Multifactor analysis of variance Cox proportional hazards analysis Hands-on session: Cox proportional hazard analysis Propensity analysis Most popular statistical packages Conclusions and take home messages 1st day 2nd day Resampling • • • • Basic concepts and applications Bootstrap Jacknife Other approaches, including Monte Carlo, Markov chain, and Gibbs sampling methods Resampling • • • • Basic concepts and applications Bootstrap Jacknife Other approaches, including Monte Carlo, Markov chain, and Gibbs sampling methods Major breakthroughs since the 70’s Major breakthroughs since the 70’s in cardiology Major breakthroughs since the 70’s in cardiology in statistics* *source: American Statistical Association Samples and populations This is a sample Samples and populations And this is its universal population Samples and populations This is another sample Samples and populations And this might be its universal population Samples and populations But what if THIS is its universal population? Samples and populations Resampling is based on the use of repeat samples from the original one to make inferences on the target population with few assumptions Samples and populations 1 1 3 5 3 1 2 3 4 1 2 3 4 5 Resampling is based on the use of repeat samples from the original one to make inferences on the target population with few assumptions Samples and populations 1 1 1 3 4 2 3 5 1 2 3 4 1 3 4 5 2 2 4 Resampling is based on the use of repeat samples from the original one to make inferences on the target population with few assumptions 1 2 3 5 Samples and populations 1 1 1 2 3 5 4 3 1 2 3 4 1 3 4 5 2 2 4 3 3 4 5 5 Resampling is based on the use of repeat samples from the original one to make inferences on the target population with few assumptions 1 2 3 5 1 2 4 5 Samples and populations 1 1 1 2 3 5 4 3 1 2 3 4 1 3 4 5 2 2 4 3 3 2 5 5 4 5 3 5 5 Resampling is based on the use of repeat samples from the original one to make inferences on the target population with few assumptions 1 2 3 5 1 4 2 5 1 3 4 5 Samples and populations 1 2 2 3 3 replacement 3 1 Random resampling with 2 5 4 3 4 4 5 5 5 5 5 3 1 1 2 3 4 1 2 3 4 5 Resampling is based on the use of repeat samples from the original one to make inferences on the target population with few assumptions 1 1 2 1 2 Random resampling without replacement 3 3 4 5 4 5 5 Resampling • Resampling refers to the use of the observed data or of a data generating mechanism (such as a die or computer-based simulation) to produce new hypothetical samples, the results of which can then be analyzed. • The term computer-intensive methods also is frequently used to refer to techniques such as these… Resampling – the hot issue • Validity of resampling processes depends on sample size and selection: each individual (or item) in the population must have an equal probability of being selected no individual (item) or class of individuals may be discriminated against (Pseudo-) random numbers • Items selected during resampling procedures are often identified by relying on random or pseudo-random numbers. • Pseudo-random numbers are apparently random numbers generated by specific algorithms that can be repeatedly generated using the same procedure and settings. Pseudo-random numbers are used very often in statistics. (Pseudo-) random numbers Desirable properties for random number generators are: • Randomness: provide a sequence of uniformly distributed random numbers. • Long period: one requests the ability to produce, without repeating the initial sequence, all of the random variables of a huge sample that current computer speeds allow for. • Computational efficiency: the execution should be rapid. • Repeatability: initial conditions (seed values) completely determine the resulting sequence of random variables. • Portability: identical sequences of random variables may be produced on a wide variety of computers (for given seed values). • Homogeneity: all subsets of bits of the numbers are random, from the most- to the least-significant bits. Cross-validation • The first and simplest form of resampling is crossvalidation, eg splitting of the original sample in two halves which are separately analyzed. • More formally, it consists in the partitioning a sample of data into subsets such that the analysis is initially performed on a single subset, while the other subset(s) are retained for subsequent use in confirming and validating the initial analysis. • The initial subset of data is called the training or derivation set. • The other subset(s) are called validation or testing sets. Resampling • • • • Basic concepts and applications Bootstrap Jacknife Other approaches, including Monte Carlo, Markov chain, and Gibbs sampling methods Bootstrap Bootstrap • The bootstrap is a modern, computer-intensive, general purpose approach to statistical inference, falling within a broader class of resampling methods. • Bootstrapping is the practice of estimating properties of an estimator (such as its variance) by measuring those properties when sampling from an approximating distribution. One standard choice for an approximating distribution is the empirical distribution of the observed data. Bootstrap • In the case where a set of observations can be assumed to be from an independent and identically distributed population, this can be implemented by constructing a number of resamples of the observed dataset (and of equal size to the observed dataset), each of which is obtained by random sampling with replacement from the original dataset. • It may also be used for constructing hypothesis tests. It is often used as an alternative to inference based on parametric assumptions when those assumptions are in doubt, or where parametric inference is impossible or requires very complicated formulas for the calculation of standard errors. Bootstrap • The advantage of bootstrapping over analytical methods is its great simplicity: it is straightforward to apply the bootstrap to derive estimates of standard errors and confidence intervals for complex estimators of complex parameters of the distribution, such as percentile points, proportions, odds ratio, and correlation coefficients. • The disadvantage of bootstrapping is: while (under some conditions) it is asymptotically consistent, it does not provide general finite-sample guarantees, and has a tendency to be overly optimistic; the apparent simplicity may conceal the fact that important assumptions are being made when undertaking the bootstrap analysis (e.g. independence of samples) where these would be more formally stated in other approaches. Efron’s explanations Efron’s explanations Efron’s explanations Efron’s explanations Bootstrap: how many samples Increasing the number of bootstrap samples is like increasing the resolution of an image Bootstrap: how many samples Increasing the number of bootstrap samples is like increasing the resolution of an image Bootstrap: how many samples Increasing the number of bootstrap samples is like increasing the resolution of an image Bootstrap: how many samples Bootstrap: how many samples • For standard deviation/error, most practitioners suggest that the number of bootstrap samples (B) should be around 200. • For 95% confidence intervals, B should be between 1000 and 2000. • Further, estimating a confidence interval usually requires estimating the 100α percentile of the bootstrap distribution. To do this, the bootstrap sample is first sorted into ascending order. Then, if α(B+1) is an integer, the percentile is estimated by the α(B+1)th member of the ordered bootstrap sample. Otherwise, interpolation must be used, between the [α(B+1)]th and ([a(B+1)]+1)th members of the ordered sample, where [ ] denotes the integer part. Consequently, choosing B=999 or B=1999 leads to simple calculations for the common choices of α. Bootstrap: which type of resampling? • Bootstrap with non-parametric resampling: makes no assumptions concerning the distribution of, or model for, the data. • Bootstrap with parametric resampling: we assume that a parametric model for the data is known up to the unknown parameter vector. • Bootstrap with semi-parametric resampling: a variant of parametric resampling, appropriate for some forms of regression. Bootstrap: which type of resampling? Then, which one to choose? • Parametric and non-parametric simulation make very different assumptions. The general principle that is that the simulation process should mirror as closely as possible the process that gave rise to the observed data. Thus, if we believe a particular model (ie we believe that the fitted model differs from the true model only because true values of the parameters have been replaced by estimates obtained from the data), then the parametric (or in regression, preferably semi parametric) bootstrap is appropriate. • However, examination of the residuals may cast doubt on the modelling assumptions. In this case, non-parametric simulation is often appropriate. It is interesting to note that, in practice, non-parametric simulation gives results that generally mimic the results obtained under the best fitting, not the simplest parametric model. Bootstrap and beyond Example: bootstrapping a logistic regression analysis Predictors selected in 25 bootstrap replications for the cohort study (n=155; 33 with events), based on forward stepwise logistic regression. The predictors selected by the actual data were variables 13, 15, 7, 20. 15 16 3 15 20 4 16 13 2 19 18 20 3 13 15 20 15 13 15 20 7 13 15 13 14 12 20 18 2 20 15 7 19 12 13 20 15 19 13 7 20 15 13 19 6 20 16 19 20 19 14 18 7 16 2 18 20 7 11 20 19 15 20 13 12 15 8 18 7 19 15 13 19 13 4 12 15 3 A pivotal work on bootstrap Bootstrap for internal validation Bootstrap for internal validation Internal validation of predictors generated by multivariable logistic regression analysis was performed by means of bootstrapping techniques, with 1000 cycles and generation of OR and biascorrected 95% CI.14 14. Steyerberg EW, Harrell FE Jr, Borsboom GJJM, Eijkemans MJCR, Vergouwe Y, Habbema JDF. Internal validation of predictive models: efficiency of some procedures for logistic regression analysis. J Clin Epidemiol 2001;54:774–781. Resampling • • • • Basic concepts and applications Bootstrap Jacknife Other approaches, including Monte Carlo, Markov chain, and Gibbs sampling methods Jacknife Jacknife • Jacknifing is a resampling method based on the creation of several subsamples by excluding a single case at the time. • Thus, the are only N jacknife samples for any given original sample with N cases. • After the systematic recomputation of the statistic estimate of choice is completed, an point estimate and an estimate for the variance of the statistic can be calculated. Bootstrap vs jacknife: the winner is… Jacknife is an acceptable approximation of bootstrap only for linear statistics Resampling • • • • Basic concepts and applications Bootstrap Jacknife Other approaches, including Monte Carlo, Markov chain, and Gibbs sampling methods Do you like Monte Carlo? Monte Carlo resampling • Monte Carlo methods are a class of computational algorithms that rely on repeated random sampling to compute their results. They provide approximate solutions to mathematical problems by sampling experiments. • Because of their reliance on repeated computation and random or pseudo-random numbers, Monte Carlo methods are most suited to calculation by a computer, being chosen when it is infeasible or impossible to compute an exact result with a deterministic algorithm. • These simulation methods are especially useful in studying systems with a large number of coupled degrees of freedom, eg fluids, social behaviors, hierarchical models, …. Monte Carlo resampling • More broadly, Monte Carlo methods are useful for modeling phenomena with significant uncertainty in inputs (eg Bayesian statistical analyses). • The term Monte Carlo was coined in the 1940s by physicists working on nuclear weapon projects in the Los Alamos (Ulam, Fermi, von Neumann, and Metropolis). The name is a reference to the Monte Carlo Casino in Monaco where Ulam's uncle would borrow money to gamble. • The use of randomness and the repetitive nature of the process are analogous to the activities conducted at a casino. Impact of Monte Carlo methods • Monte Carlo simulation methods do not always require truly random numbers to be useful — while for some applications unpredictability is vital. • Many of the most useful techniques use deterministic, pseudo-random number sequences, making it easy to test and re-run simulations. • Monte Carlo resampling simulation takes the mumbojumbo out of statistics and enables even beginning students to understand completely everything that is done. • The application of Monte Carlo methods in teaching statistics also is not new. Impact of Monte Carlo methods • What is new and radical is using Monte Carlo methods routinely as problem-solving tools for everyday problems in probability and statistics. • Computationally intensive, but conceptually simple, methods belong at the forefront, whereas traditional analytical simplifications loose in importance. • Monte Carlo simulations are not only relevant for simulating models of interest, but they constitute also a valuable tool for approaching statistics. Markov chain • In mathematics, a Markov chain, named after Andrey Markov, is a discrete-time stochastic process with the Markov property. • Having the Markov property means for a given process that knowledge of the previous states is irrelevant for predicting the probability of subsequent states. • In this way a Markov chain is "memoryless", no given state has any causal connection with a previous state. • At each point in time the system may have changed states from the state the system was in the moment before, or the system may have stayed in the same state. The changes of state are called transitions. If a sequence of states has the Markov property, then every future state is conditionally independent of every prior state. Markov chain • The PageRank of a webpage as used by Google is defined by a Markov chain. • Markov chain methods have also become very important for generating sequences of random numbers to accurately reflect very complicated desired probability distributions, via a process called Markov chain Monte Carlo (MCMC). • In recent years this has revolutionised the practicability of Bayesian inference methods, allowing a wide range of posterior distributions to be simulated and their parameters found numerically. Markov chain Monte Carlo (MCMC) • Markov chain Monte Carlo (MCMC) methods, are a class of algorithms for sampling from probability distributions based on constructing a Markov chain that has the desired distribution as its stationary distribution. • The state of the chain after a large number of steps is then used as a sample from the desired distribution. The quality of the sample improves as a function of the number of steps. • Usually it is not hard to construct a MCMC with the desired properties. The more difficult problem is to determine how many steps are needed to converge to the stationary distribution within an acceptable error. Metropolis-Hastings algorithm • The Metropolis-Hastings algorithm is a method for creating a Markov chain that can be used to generate a sequence of samples from a probability distribution that is difficult to sample from directly. • This sequence can be used MCMC simulation to approximate the distribution (i.e., to generate a histogram), or to compute an integral (such as an expected value). • The Gibbs sampling algorithm is a special case of the Metropolis-Hastings algorithm which is usually faster and easier to use but is less generally applicable in physics (but actually is the most common in Bayesian analysis). Gibbs sampler • Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables. • The purpose of such a sequence is to approximate the joint distribution, or to compute an integral (such as an expected value). • Gibbs sampling is a special case of the Metropolis-Hastings algorithm , and thus an example of a Markov chain Monte Carlo algorithm (MCMC). More specifically, it is a specific form of MCMC in which values Xn at successive sites n are updated using the full conditional distribution of Xn given the values Xm at all other sites m = n. It is also known as the heat bath sampler. Successive sites n may be chosen systematically or randomly. • It is also known as the heat bath sampler. One of my favourite articles One of my favourite articles A pivotal work based on Monte Carlo Methods Concato et al, JCE 1995 Softwares • BoxSampler (add-in for Excel) •C • DDXL •R • Fortran • Lisp • Pascal • Resampling Stats (add-in for Excel) • RiskAMP •S •S-Plus •SAS • SimulAr • StatsXact Questions? Take home messages • Resampling methods are being increasingly used given the major breakthroughs in computational power • Among cross-validation, jacknife, and bootstrap, the latter is the most powerful and robust tool for statistical inference and validation. • Monte Carlo simulation methods have become more and more common for estimating the bias of different statistical procedures and for complex Bayesian models. • In all but very exceptional cases, resampling methods are best left in the hands of a statistical professional, working together with the clinician in order to achieve the clinician’s goal safeguarding validity And now a brief break… For further slides on these topics please feel free to visit the metcardio.org website: http://www.metcardio.org/slides.html