Download Best Practices in Statistical Data Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Best Practices in Statistical Data Analysis
Valedictory Symposium for Willem J. Heiser
Leiden, January 30st, 2014
Program
13.00 David Hand: What’s the problem: Answering the statistical question
13.30 Henk Kelderman: Improving latent variable models by using collateral variables
–
14.30 Serge Rombouts: Best practices in Functional Magnetic Resonance Imaging of the Brain
15.00 Richard Gill: Worst practices in statistical data analysis
–
16.00 Leland Wilkinson: Anomalies
16.30 Lawrence Hubert: Henry A. Wallace (1988-1965): Agricultural Statistician (Econometrician) Extraordinaire
Abstracts
What’s the problem: Answering the statistical question
David Hand
George Box once pointed out that statistical analysis involved two approximations: a big one, which
is the approximation to the problem you want to solve, and a small one, leading to finding the
solution to the approximate problem. I give four examples showing that we sometimes devote too
little attention to the big approximation, so that we end up finding precise answers to questions
which are irrelevant to our objectives.
Improving latent variable models by using collateral variables
Henk Kelderman
There are several reasons why latent variables models are not always appropriate for analyzing test
and questionnaire data. Since by definition not much is known a-priori about latent variables, one
needs either strong assumptions or a lot of data to estimate latent variable models. For example
for quality of life or clinical questionnaires latent variables may not be normally distributed. One
usually must have a large sample to be able to estimate a latent variable model with less restrictive
distributional assumptions. Another evil afflicts the assumption of independence of measurement
errors. Administering a test can be seen a psychological experiment with sequential item-trials.
When subjects answer a questionnaire item, they read and try to understand the question, retrieve
information from memory, make a judgment, and give a response. It would be hard to convince a
seasoned experimentalist that these processes do not influence those of the next item. Both problems can be tackled by adding many additional variables from the nomological network around
the latent variable of interest to the model. In this paper we show how statistical learners could be
employed to improve structural equation models with latent variable models.
Best practices in Functional Magnetic Resonance Imaging of the Brain
Serge Rombouts
Functional Magnetic Resonance Imaging (FMRI) of the Brain is a technique to image brain activation. FMRI data in one individual consists of ˜100,000 brain ‘voxels’ (voxels are 3D pixels), each
with a few hundred time points. Each voxel’s time course represents dynamic FMRI brain activity
of a specific region in the brain. Usually one or more groups of subjects are studied. FMRI applications include studying brain function in normal controls, in psychiatric and neurologic patients, in
brain development, aging, after pharmacologic manipulations, for pre-surgical planning and the
association of genetic information with regional brain function.
Two sorts of FMRI studies can be distinguished: ‘task-FMRI’ and ‘resting state FMRI connectivity’. In task-FMRI, brain activation is manipulated using a task. Analyses are aimed at finding
task-related brain activation. In studies of resting state FMRI connectivity, spontaneous changes in
brain activity are studied, without the application of an externally controlled task. Here, functional
connectivity of spontaneous FMRI activity in different brain regions is studied using correlation
and regression techniques.
In each individual, preprocessing of FMRI data includes temporal and spatial filtering, and
motion correction. Next, in each individual a general linear model is applied for statistical analysis. For task-FMRI, the statistical model has at least one regressor representing the expected taskrelated behavior. For resting state FMRI connectivity the regressors are the spontaneous FMRI
signal in one or more regions of interest. Analysis results in 3D images containing betas representing the association of the voxels’ time courses with the temporal regressors of the model.
For group analysis, individual data is registered to a standard brain and group statistics are applied on the individual 3D images that result from the individual data analysis. Statistical thresholding on group level requires correction for multiple testing in the different brain regions.
I will discuss the various analysis steps in FMRI research and discuss best practices for a number
of challenges that one encounters in these analyses.
Worst practices in statistical data analysis
Richard Gill
After a long and bitter conflict between the authors, the paper Geraerts, McNally, Jelicic, Merckelbach & Raymaekers (2008) "Linking thought suppression and recovered memories of childhood
sexual abuse" was finally retracted from the journal "Memory". Quite extraordinary errors seem
to have been made in the preparation of the data for statistical analysis. How could this happen,
and how could those errors have been prevented? And why did it take so long to put this right?
As well as statistical issues, I will discuss the role of the media and the role of university administrators in the affair. I argue that when anomalies are discovered in scientific work, the right
way to proceed is to discuss the anomalies openly in the scientific community. Possibly the worst
way to proceed, and I will try to explain why, is to bring the conflict into the realm of judicial or
disciplinary investigations by committees on scientific integrity. Let’s focus in the first place on the
integrity of the results, not on the integrity of the persons.
Anomalies
Leland Wilkinson
Statisticians usually consider anomalies to be identifiable through a process of looking for outliers.
Anomalies, however, are more general than outliers. Even for the restrictive case of points in a
real coordinate space, outliers can be considered anomalies but anomalies are not necessarily outliers. In this talk, I will explore this distinction and present strategies for recognizing anomalies in
pointwise data. Some of these strategies are motivated by the kinds of methods favored by Willem
Heiser and others at Leiden.
Henry A. Wallace (1988-1965): Agricultural Statistician (Econometrician) Extraordinaire
Lawrence Hubert
This talk is about Henry A. Wallace who was (among other positions he held) the U.S. Secretary of
Agriculture under Roosevelt and the New Deal (1933-1940), and Vice-President under Roosevelt
(1941-1945). Wallace demonstrated the absolute best practices in (agricultural) statistics and econometrics, including his time as the New Deal Secretary of the USDA. He demanded an evidencebased set of agricultural policies and practices (through statistics and data gathering) that helped
pull the U.S. out the the Great Depression. Wallace is arguably the single person responsible for
the first Department of Statistics in the U.S.; this is because of Wallace’s connections with George
Snedecor. Finally, Wallace was the lead architect in the development of numerical procedures (in
the 1920s) for solving the normal equations in large-scale multiple regression.