Download JK Lindsey Source: Journal of the Royal Statistical Society. Series D

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia, lookup

History of statistics wikipedia, lookup

Foundations of statistics wikipedia, lookup

Transcript
Some Statistical Heresies
Author(s): J. K. Lindsey
Source: Journal of the Royal Statistical Society. Series D (The Statistician), Vol. 48, No. 1
(1999), pp. 1-40
Published by: Blackwell Publishing for the Royal Statistical Society
Stable URL: http://www.jstor.org/stable/2680893 .
Accessed: 27/02/2011 12:19
Your use of the JSTOR archive indicates your acceptance of JSTOR's Terms and Conditions of Use, available at .
http://www.jstor.org/page/info/about/policies/terms.jsp. JSTOR's Terms and Conditions of Use provides, in part, that unless
you have obtained prior permission, you may not download an entire issue of a journal or multiple copies of articles, and you
may use content in the JSTOR archive only for your personal, non-commercial use.
Please contact the publisher regarding any further use of this work. Publisher contact information may be obtained at .
http://www.jstor.org/action/showPublisher?publisherCode=black. .
Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed
page of such transmission.
JSTOR is a not-for-profit service that helps scholars, researchers, and students discover, use, and build upon a wide range of
content in a trusted digital archive. We use information technology and tools to increase productivity and facilitate new forms
of scholarship. For more information about JSTOR, please contact support@jstor.org.
Blackwell Publishing and Royal Statistical Society are collaborating with JSTOR to digitize, preserve and
extend access to Journal of the Royal Statistical Society. Series D (The Statistician).
http://www.jstor.org
The Statistician(1999)
48, Part1, pp. 1-40
Some statisticalheresies
J. K. Lindsey
LimburgsUniversitair
Centrum,Diepenbeek, Belgium
[Read before The Royal Statistical Society on Wednesday, July 15th, 1998, the President,
ProfessorR. N. Curnow,in the Chair]
Summary. Shortcomingsof modernviews of statisticalinferencehave had negativeeffectson the
image of statistics,whetherthroughstudents,clientsor the press. Here, I question the underlying
foundationsof moderninference,includingthe existence of 'true'models, the need forprobability,
whetherfrequentist
or Bayesian, to make inferencestatements,the assumed continuity
ofobserved
data, the ideal oflargesamples and the need forproceduresto be insensitiveto assumptions.Inthe
contextofexploratory
inferences,I considerhow muchcan be done by usingminimalassumptions
relatedto interpreting
a likelihoodfunction.Questions addressed includethe appropriateprobabilisticbasis of models, ways of calibratinglikelihoodsinvolving
differing
numbersof parameters,the
roles of model selectionand model checking,the precisionof parameterestimates,the use of prior
and the relationshipof these to sample size. I compare thisdirectlikelihood
empiricalinformation
approach withclassical Bayesian and frequentistmethods in analysingthe evolutionof cases of
acquired immunedeficiencysyndromeinthe presence of reporting
delays.
Keywords:Acquiredimmunedeficiencysyndrome;Akaike'sinformation
criterion;
Asymptotics;
Compatibility;
Consistency;Discretedata; Hypothesistest;Likelihood;Likelihoodprinciple;Model
Poisson distribution;
selection;Nonparametric
models; Normaldistribution;
Robustness; Sample
size; Standarderror
1. Introduction
Statisticians
are greatlyconcernedaboutthe low publicesteemforstatistics.
The disciplineis
oftenviewedas difficult
andunnecessary,
orat bestas a necessaryevil.Statisticians
tendto blame
thepublicforitsignoranceof statistical
butI believethatalmostall of thefaultlies
procedures,
withstatisticians
andthewaythattheyuse,and teach,statistics.
The twogroupswith
themselves
whom statisticians
the
have the most directcontactare studentsand clients.Unfortunately,
and contradictions.
Much of thisarisesfromthe
messagethatthesereceiveis fullof conflicts
inordinate
emphasisthatstatisticians
place on a certainnarrowviewof theproblemof inference
withtheteachingofthefirstintroductory
statistics
course.Knowinga vastarray
itself,beginning
than
of tests,or,morerecently,
estimation
is consideredto be muchmoreimportant
procedures,
whichwillbe suitableforeach particular
beingawareof whatmodelsare available,and studying
dataanalysis.
questionat handandfortheaccompanying
In whatfollows,I considersomeoftheshortcomings
ofmodemviewsof statistical
inference,
This is necessarybecausetheseviewsare increasingly
Bayesianand frequentist.
beingignoredin
statistics
appliedstatistical
practice,as thegap betweenit andtheoretical
widens,and are having
is difficult
to find
negativeeffectson studentsand clients.An adequatemeansof presentation
because of the vast and internally
natureof both approaches:to any critical
contradictory
J.K. Lindsey,Department
of Biostatistics,
Address
3590 Diepenforcorrespondence:
LimburgsUniversitair
Centrum,
beek,Belgium.
E-mail:jlindsey@luc.ac.be
? 1999 RoyalStatisticalSociety
0039-0526/99/48001
2
J.K Lindsey
but
statement,
some Bayesianor frequentist
is certainto say 'I am a Bayesianor frequentist
of
wouldneverdo that'.For example,withthegrowingrealizationof thepracticaldifficulties
haveturnedto evaluating
implementing
a purelysubjectiveBayesianparadigm,
somestatisticians
criteria;I see themas
techniquesderivedthroughthe Bayes formulaby long run frequentist
probabilities
forobseressentiallyfrequentists.
Or considerhowmanyBayesiansuse frequency
evaluatesinference
methodsaccordingto theirlong
vables.Forme,thequintessential
frequentist
in repeatedsampling,
runproperties
whereastheequivalentBayesiandependson methodsbeing
thedata.
coherent
afterapplying
theBayesformula
topersonalpriorselicitedbeforeobserving
has evolvedin manyand unforeseeable
ways in the past few decades.
Statisticalinference
supremacy,
in spiteof
Previously,
'classical' frequentist
proceduresseemedto hold uncontested
themajorcontributions
of R. A. Fisher(and Neyman'sdenialthatinference
was possible).Much
of statistical
oftheimpetusforthischangehas comefromthecomputerization
analysisand from
of modernstatistical
inference
is thecentralrole
theBayesiancritique.One ofthemajorfeatures
whichwas introduced
of the likelihoodfunction,
of
long beforeby Fisher.Withthe rewriting
statistical
fewmodernstatisticians
realizethatFisherwas nevera frequentist
forinference
history,
of probability),
beingmuchcloserto the
(althoughhe generallyused a frequency
interpretation
thanto theNeyman-Pearson
Bayesians,suchas Jeffreys,
school.
In thefollowing
text,I look at someof thesedominant
ideas,old and new,especiallyas they
in
arecurrently
letme clarify
theareato be discussed.I
practised appliedstatistics.
However,
first,
with
in
the
Fisherian
of
am concerned
statistical
sense obtaining
themaximuminforinference,
mationfromthegivendataaboutthequestionat hand,without
incorporating
priorknowledgeand
availableas
information,
otherthanthatrequiredin theconstruction
ofthemodel(or empirically
theargument
to follow),and
likelihoodfunctions
frompreviousstudies,butthatis anticipating
be used (althoughthis
withoutconcernfortheway in whichtheconclusionswill subsequently
in howthequestionis posed). I thusexcludedecisionproceduresthatrequire
maybe reflected
modelofhumanbehaviour.
somesupplementary
I shalluse thetermknowledgeto referto availablescientific
theories,expressibleas models,
thosetheories.
and information
to referto empirically
observabledata,alwaysfiltered
through
from(prior)personalbeliefs.
Thus,bothofthesemustbe distinguished
i.e. to exploratory
rather
than
I restrict
attention
to modelselectionandestimation
ofprecision,
In current
at leastin myexperience,
morethan95% of
inference.
scientific
confirmatory
research,
workof mostappliedstatisticians
is exploratory,
as opposedto testingone precise
theinferential
modelconceivedbeforethepresentcollectionof data.Scientists
usuallycometo thestatistician
to be filledoutby empiricaldata collection,notwitha specific
witha vaguemodelformulation
to test.Powerful
and easilyaccessiblestatistical
packageshavemade this
hypothesis
computers
When
to be necessary.
appearto be a routinetaskwherethestatistician
maynotevenbe thought
to test,theyoftenno longerneed
reachthestageof havinga clear simplehypothesis
scientists
the factsmayalmostseem to speak forthemselves.(Phase III trialsin
statistics:
sophisticated
are required,
at leastformally.)
researchare a majorexceptionwherestatisticians
pharmaceutical
facts.
Theoriesofstatistical
inference
havenotadjustedto theseelementary
inference
As I see it,two basic principlesshouldunderliethe choice amongstatistical
procedures.
to draw
selectedaxioms can be used to provea theorem;in contrast,
(a) In mathematics,
in the availableempiricaldata abouta set of
statisticalinferences,
all the information
models,i.e. aboutthe questionat hand,mustbe used if the conclusionsare not to be
arbitrary.
ofreality,
withepistemological,
butno ontological,
crudesimplifications
(b) Modelsarerather
Statistical
Heresies
3
status.If observablesare important,
inferences
mustbe invariant
to theparameterizations
inwhichsuchmodelsarespecified,
theirapplication
inprediction
beinga typicalexample.
Manystatisticians
are willingto hedgethe firstprincipleby usingonlythe relevantempirical
information,
as maybe thecase whenan 'appropriate'
teststatistic
is chosen.This could leave
inference
opento thearbitrary:
withouta cleardefinition,
therelevantcan be chosento provea
point.(Models can also be so chosen,buttheyare thereforcriticsto scrutinize.)
It is generally
at least
agreedthatthisprinciplemeansthatinference
mustbe based on thelikelihoodfunction,
whencomparingmodels and estimating
the precisionof parameters.
(The problemof model
forusingall theinformation
arenotgenerally
available.)Thus,
checkingis morecomplex;criteria
in the data; the likelihoodfunction
modelsneed notbe constructed
to use all the information
determines
whatis relevant
to theproblem.In turn,thisfunction
Both
obeysthesecondprinciple.
ones: for
principlestogetherexclude manyfrequentist
procedures,especiallythe asymptotic
intervals
based on standarderrorsdo notuse efficiently
example,forsmallsamples,confidence
in the data and are notparameterization
invariant.
The secondexcludesmany
the information
practicalBayesianprocedures:
forcontinuous
parameters,
forexample,regionsof highestposteriordensityare notpreserved
undernon-linear
transformations
whereasthosebased on tail areas
are.
Havingsaid this,we mustthenask whatis left.The commoncriticisms
ofboththefrequentist
and the Bayesianschoolsare sufficiently
well knownand do not need to be restated.I rather
considerjust how much can be done using minimalassumptionsrelatedto interpreting
a
likelihoodfunction.
Manyof theideas presentedherearosewhileI was doingtheresearchfor
Lindsey(1996a), althoughmostare notnecessarily
explicitly
statedin thattext.The pathbreaking,butlargelyignored,
paperofBarnardetal. (1962) is fundamental
reading.
1.1. Example
I shalluse a simpleexampleto illustrate
someofthepointsdiscussedlater,althoughobviouslyit
will notbe appropriate
forall of them.In suchlimitedspace, I can onlyprovidea parodyof a
properanalysisofthescientific
questionunderconsideration.
The estimationof the incidenceof acquiredimmunedeficiencysyndrome(AIDS) is an
modelthatwill describe
taskin modernsocieties.We wouldliketo obtaina stochastic
important
untilthepresentandthat
howtheepidemichas evolvedgloballywithina givencountry
succinctly
in relationto healthservicerequirements.
Data are obtainedfrom
maybe usefulforprediction
not
doctorswhomustreportall cases to a centralofficein theircountry.
Thus,we studydetection,
incidence.However,reporting
maytaketimeso we have a secondstochasticprocessdescribing
thearrivalofthedoctors'reports.
One difficult
problemarisesbecauseofthesedelays.Some reports
maytakeseveralmonthsto
over recentperiodsare incomplete.Thus, the
arriveso, at any giventime,the observations
table withthe periodsof
can be represented
as a two-waycontingency
availableinformation
in thebottomright-hand
corneris
interest
as rowsandthereporting
delaysas columns.A triangle
missing.Forexamplesof suchtables,reported
by quarteryear,see De Angelisand Gilks(1994)
forEnglandandWalesandHayandWolak(1994) fortheUSA. In sucha table,therowtotalsgive
butthemostrecentvaluesaretoo smallbecauseof
thenumberofdetectedAIDS cases perquarter
themissingtriangle.For thesetwo data sets,theobservedmarginaltotalsare plottedin Fig. 1
curvesfromsomemodelsthatI shalldiscussbelow.We see howthereported
alongwiththefitted
inthemostrecentquarters
becauseofthemissingdata.
numbers
dropdramatically
4
J. K Lindsey
0
to
1984
1986
1988
Year
(a)
l1990
1992
1 982
1984
1986
1988
l1990
Year
(b)
Fig. 1. Observed totals (e) and several models to estimateAIDS cases in (a) England and Wales and (b) the
USA:
,stationary nonparametric;------- non-stationary
paranonparametric;.......
non-stationary
metric
2.
True models
Most statisticiansadmitthattheirmodels are in no sense 'true', although some argue thatthereis
a true underlyingmodel; many like to cite a well-knownphrase of Box (1976, 1979) that all
models are wrong but some are useful (which models depend on the question at hand, mightI
add). Nevertheless,in practice,statisticians,when makinginferences,do not act as if thiswere so:
almost all modern statisticalprocedures,Bayesian and frequentist,are implicitlybased on the
assumption that some one of the models under consideration is correct. This is a necessary
conditionfor decision-makingbut antitheticalto scientificinference,a firstfundamentaldistinctionbetweenthetwo.
The assumption that a model is true is sometimes clearly and openly counter-intuitive;
for
testswhethera null hypothesisis correct,generallyhoping thatit is not, so
example, a frequentist
thatthe alternativehypothesiscan be retained.In contrast,the Fisherianinterpretation
of a test is
throughP-values providing weight of evidence against a hypothesis; they do not require an
alternativehypothesis.(Alternativehypotheses,and the accompanyingpower calculations, were
introducedby frequentiststo choose among Fisheriantests.) Asymptotictheoryis more subtle; it
findsits rationale in an estimatedmodel convergingto the truemodel as the numberof observations increases: hence, forexample, the importanceof asymptoticconsistency,discussed below, in
frequentist
procedures.
In a similar way, the Bayesian prior gives the probabilitiesthat each of the various models
under considerationis the truemodel. The likelihood principle,also discussed below, carries the
assumptionthatthe model functionis true and thatonly the correctparametervalues need to be
determined.
Many will argue thatthese are 'as if' situations:conditionalstatements,not to be taken literally.
But all statisticalthoughtis permeatedby this attitudewhich, if we take modelling seriously,has
far-reaching
negativeimplications,fromteachingto applications.
StatisticalHeresies
5
Thus,muchof statistical
inference
revolvesaroundtheidea thatthereis some 'correct'model
is to findit. But,theverypossibility
and thatthejob of the statistician
of makinga statistical
All differences
inference,
and indeedall scientific
work,dependson marginalization.
among
observational
to be pertinent
unitsthatare notthought
forthequestionat handare ignored.In
to be possible,theobserveduniqueindividuals
otherwords,forgeneralization
mustbe assumedto
be 'representative'
withinthosesubgroups
thatarebeingdistinguished.
and exchangeable
Thisis
alwaysan arbitrary,
butnecessary,
assumption,
whichis nevertrue.The only'model'thatone (not
I) mightcall truewould be thatdescribingthe historyof everysubatomicparticlesince the
of theuniverse,
and thathistory
is unobservable
beginning
becauseobservation
wouldmodifyit.
All othermodelsinvolvemarginalization
thatassumesthatseemingly
irrelevant
differences
can
be ignored.(See LindseyandLambert(1998) forfurther
discussion.)
We shouldrather
thinkin termsof modelsbeingappropriate
whichare useful
simplifications,
fordetectingand understanding
generalizablepatternsin data,and forprediction;
the typeof
to be detectedandunderstood,
thepredictions
patterns
to be madeandthemodelsto be used will
dependon thequestionsto be answered.The complexity
of themodelwill determine
howmuch
dataneedto be collected.By definition,
no model,in scienceor in statistics,
is evercorrect;nor
does an underlying
of reality,
truemodelexist.A modelis a simplification
constructed
to aid in
itis an epistemological
tool.
understanding
it;as statedinmysecondprinciple,
2. 1. Example
In studying
in theoveralltrend.An appropriate
theevolutionof AIDS, we are interested
model
can assumea smoothcurveovertime,ignoring
themore'correct'modelthatallowsforvariations
in detection
An even 'truer'
overthedaysof theweek,withfewercases observedon week-ends.
modelwould have to accountforthe doctors'officehoursduringeach day.(Of course,such
information
is notavailablefromthedataas presented
above.)
In a 'true'model,varioussubgroups,
suchas homosexualsor drugusers,shouldalso be taken
intoaccount.Butwe muststopsomewhere,
beforeintroducing
sufficient
characteristics
to identify
individualsuniquely,or we shallbe unableto makeanygeneralizations,
i.e. anyinferences.
For
thequestionat hand,theoverallevolutionof AIDS in a country,
suchsubgroups
be
may ignored
although
modellingthemandthenmarginalizing
mayproducemorepreciseresults.
3.
Probability statements
are convincedthatinferencestatements
mustbe made in termsof
Virtuallyall statisticians
It providesa metricbywhichsuchstatements,
arethought
madein different
probability.
contexts,
to be comparable.However,fora probability
and anymodelbased on it,to exist,the
distribution,
notjust thoseeventsobservedor directly
under
space of all possibleeventsmustbe defined,
In thissense,probability
consideration.
is fundamentally
we mustknowexactlywhatis
deductive:
This is finefordecisionprocedures,
but not forinference,
whichis
possiblebeforestarting.
insiston usingprobability
forinference?
inductive.
Whythendo statisticians
arenotagreedabouteitherthedefinition
of inference
ortowhatsuch
Statisticians
probabilities
statements
shouldrefer.
Twowidelyacceptedjustifications
canbe foundintheliterature:
inthelongrunand
oferrorsin inference
statements
can be controlled
(a) theprobability
ofunknowns
new
can be elicited,beforeobtaining
(b) ifcoherent
personalpriorprobabilities
and updatedby the Bayes formulathe posteriorwill also have a coherent
information,
probabilistic
interpretation.
6
J.K. Lindsey
The Fisherianenumeration
ofall possibilities
undera hypothesis,
yieldinga P-valueindicating
whetheror nottheobservations
are rare,could conceivablybe used witheithera frequentist
or
Anotherapproach,Fisher'sfiducialprobability,
is notwidely
personalprobability
interpretation.
accepted.
The case againstfrequentist
is well known.Amongotherthings,they
probability
statements
referto and dependon unobserved
outcomesthatare notrelevantto theinformation
and questionat hand.(This can be reasonableformodelcriticism,
as withtheFisheriantest.)The related
problemswithBayesianprobability
statements
havebeenlittlediscussed,exceptfortheobjection
thattheyarenot'objective',no empiricalcheckbeingavailable.
The FisherianP-valueis perhapsthemostwidelyused inference
in the
probability.
Judging,
theobserveddataarerareundera probability-based
senseofa Fisheriansignificance
test,whether
statistical
modelrelieson theexistenceof a space of observableevents,thesamplespace. The
weak:forgivendata,it is notuniquebecauserarenesscan be definedin many
judgmentis rather
no
ways.Such testsgive an indicationof absolutelack of fitof one completemodel,providing
measureforcomparison
amongmodels(thereis no alternative
hypothesis).
The moreobservations
thereare,themorechancethereis of detecting
lack of fitof thegivenmodel,a pointto whichI
return
later.In contrast,
thelikelihoodfunction
onlyallowsa comparison
amongmodels,providingmeansofcheckingonlyrelative,
andnotabsolute,adequacy.
are not based on the likelihoodfunction(althoughsome
Frequentist
probability
statements
involvethelikelihoodratiostatistic).
Theyuse goodness-of-fit
criteriain an attempt
to compare
are not comparableoutsidethe contextin whichtheyarise,except
models; such statements
thatarenotpertinent
through
longrunarguments
to thedataat hand.Thisis one basis ofthecase
statements
basedon frequentist
againstinference
probability.
The factthata frequentist
testhas beenchosento havepoweragainstsomealternative
is notan
testin themodelselectioncontext(wheretheFisheriantestwouldnot
advantageovera Fisherian
be used anyway).It is misleadingbecause the testwill also have poweragainstmanyother
alternatives.
chosengenerallyleads off
Developinga model in the directionof the alternative
finalmodel.For example,do autocorrelated
residualsin a time
along a pathto an inappropriate
ornon-linearity?
seriesmodelindicatethewrongMarkovian
order,missingortoomanycovariates
Indeed,a hypothesis
mayeven be rejected,notbecause it is wrong,butbecause the stochastic
ofthemodelinwhichitis embeddedis inappropriate.
component
Otherarguments
inferenceincludethe assumptionthathypothesesare
againstfrequentist
formulated
beforelookingat the data,whichis impossiblein an exploratory
context,and the
problemof sequentialtesting,bothdiscussedbelow.The probabilities
arisingfromfrequentist
inference
statements
do notadequatelymeasureuncertainty.
in favourof subjectiveBayesian
Let us now turnto the Bayesianapproach.The argument
relies cruciallyon the principleof coherence.Unfortunately,
thishas onlybeen
probabilities
shownto applyto finitely
additivesetsof models;theextension,
evento countably
additivesets,
it is simplywidelyassumed'formathematical
has not been demonstrated;
convenience'(Hill
contentof most
(1988) and Bernardoand Smith(1994), pages 44-45). Thus,theprobabilistic
is no morethanan unproven
statements
Bayesianinference
assumption.
Besides coherence,theothercriteriaforthevalidityof Bayesianprobabilistic
statements
are
thatthepriorbe personaland thatit be specifiedbeforeobtainingnew information.
Bayesians
butwhoproposeto use a flatprior,to try
who do nothavetheirownpersonalpriordistribution,
variouspriorsto checkrobustness,
to use some mathematically
diffuseprioror to
convenient
thedata,do notfulfilthesecriteria.
In no case do theirposterior
choosea 'prior'afterinspecting
to have
havea personalprobability
'probabilities'
interpretation,
althoughsomemaybe calibrated
long run frequentist
properties.Indeed,thesepeople do not have the courageto admitthat
StatisticalHeresies
7
inferences
canbe drawndirectly
fromthelikelihoodfunction.
Thisis whattheyare,in fact,doing,
at least in thefirsttwo cases. For thosewho are capable of formulating
or elicitinga personal
prior,itsroleshouldbe, notto be combinedwith,butto be comparedwiththelikelihoodfunction
to checktheiragreement,
say in thediscussionsectionof a paper;'sceptical'and 'enthusiastic'
priorscan also be usefulin thiscontext(see, forexample,Spiegelhalter
et al. (1994)). Anyone
whohas studiedcloselythedebateabouttheselectionofpriordistributions,
whichhas been ably
summarized
byKass andWasserman
(1996), mustsurelyhavesomedoubtsaboutusingBayesian
forinference.
techniques
For a priorprobability
distribution
to exist,thespace,hereof all possiblemodelsto be condefined.Anymodelsthatare notincluded,or thathave zero prior
sidered,mustbe completely
probability,
willalwayshavezeroposterior
probability
undertheapplication
oftheBayesformula.
No empiricalinformation
can giveunforeseen
modelspositiveposterior
probability.
This use of
probability
directly
contradicts
theveryfoundation
of scientific
inference,
fromthespecificto the
general,wherethegeneralcan neverbe completely
knownor foreseen.UnderformalBayesian
scientific
is impossible.
procedures,
discovery
In contrast,
in decision-making,
such a completeenumeration
of all possibilitiesis a prerequisite.This is a second fundamental
distinction
betweenscientific
inferenceand decisionmaking.
At leasttheoretically,
Bayesianscan easilyplace personalpriorson parameters
withinmodels.
However,as mentioned
earlier,manyBayesianconclusions,e.g. aboutthecredibility
regionin
whicha set of parameters
mostprobablylies, are not parameterization
invariant
(unlessthey
whentheychangetheparameterization;
see Wasserman
changetheirpriorappropriately
(1989)).
As well,putting
of thespace of
priorson structural
subspaces(i.e. on different
modelfunctions)
all possiblemodelsis extremely
difficult.
For example,how can modelsbased on logisticand
be directly
probitregression
compared?Varioussolutionshavebeenproposed,thetwomostwell
thatplace pointprobability
knownbeingBayesfactors
masseson thevarioussubspaces(see Kass
in a widerclass,although
thelatterwouldhaveto
andRaftery
(1995) fora review)andembedding
be available,withtheappropriate
priors,beforeobservingthe data,fortheposteriorto have a
No solutionis widelyaccepted.
probabilistic
interpretation.
in
We mustconcludethat'probabilities'
statements
arisingfromalmostall Bayesianinference
realisticappliedcontextsdo not have a rigorousprobabilistic
interpretation;
theycannotadequatelymeasureuncertainty.
basedon thesamplespace,are critically
on thechoiceofa
Frequentist
conclusions,
dependent
model'sstochasticstructure.
The procedures
are alwaysimplicitly
testingthischoice,as well as
theexplicithypothesis
aboutparameters:
a modelmaybe rejectedbecausesomeirrelevant
partof
thatare
it is misspecified.
mustadopta modelthatonlymakesassumptions
Thus,a frequentist
of nonparametric
reallynecessaryto thequestionsto be answered(explainingtheattractiveness
as discussedfurther
methods,
later).In contrast,
Bayesianconclusionsaboutparameters
depend
model
thattheoverallstochastic
onlyon theobserveddataandtheprior,butalwaysconditionally
thatforwhichpriorswere
is true.Here,theglobal set of modelfunctions
underconsideration,
constructed
beforeobservingthedata,cannotbe placed in question,whereasin the frequentist
i.e. about the problemof interest,
approachconclusionscannotbe drawnabout parameters,
thevalidityofthemodelfunction
itself.
without
simultaneously
questioning
If we concludethatthe 'probability'
statements
used in drawingstatistical
conclucurrently
sions,whether
Bayesianor frequentist,
generallyhaveno reasonableprobabilistic
interpretation,
forscientific
at least in the
foundation
theycannotprovideus withan appropriate
inference,
context.We shallencounter
further
reasonsas we proceed.We are thenleftwiththe
exploratory
as we shallsee. Itprovidesthemeansof
likelihoodfunction
whichalwayshas a concretemeaning,
8
J.K. Lindsey
incorporating
priorknowledgeand information,
obtainingpoint estimatesand precisionof
modelcriticism
parameters,
comparing
models,includingerrorpropagation,
and prediction,
all
without
probability
statements,
exceptabouttheobserveddata(Lindsey,1996a).
The globalprobability
modelunderconsideration
represents
thatpartof presenttheoretical
whichis notin questionin thestudywhilepriorempiricalinformation
knowledge
can be incorfrompreviousstudies.Model selectionand precision
poratedby usingthelikelihoodfunctions
estimation
will involvelookingat ratiosof likelihoods(Fisher(1959), pages 68-75, and Sprott
andKalbfleisch
(1965)). Becausethelikelihoodfunction
ofmodels,fora
onlyallowscomparisons
givendata set, it is usuallynormedby dividingby its maximumvalue, so thatthisvalue is
notas a pointestimate,
important,
butas a reference
is used
point.Whenthelikelihoodfunction
in thiswayforinference,
thereis no conceptualdifference
betweencomparing
theprobability
of
thedataforvariousparameter
valueswithina givenmodelfunction
andcomparing
variousmodel
functions
(Lindsey,1974a). This providesa powerfulsense of coherenceto thisapproach.In
contrast,
exceptfortheAkaikeinformation
criterion
(AIC), no widelyacceptedmethodsfora
directcomparisonof non-nested
modelsare available,withineithertheBayesianor frequentist
frameworks.
3. 1. Example
The reporting
of AIDS is a one-timeaffair,
nota samplefromsomewell-defined
population,
so
anyfrequentist
samplespace can onlybe some arbitrary
The modelsused
conceptualconstruct.
andreporting
byDe AngelisandGilks(1994) assumethatAIDS detection
delaysareindependent
and thatthese two processescan jump arbitrarily
betweentimepoints.Their firstmodel is
to independence
inthecontingency
table:
equivalent
log(ij) = ai + f/j
(1)
where ij is themeannumber
ofreported
cases withi andj respectively
indexingdiscretequarters
and reporting
of reporting
overtime.
delays.It assumesthatthedistribution
delaysis stationary
Thepredicted
marginalcurveis plottedas thefullcurvein Fig. 1(a). Thismodelhas a devianceof
The Fisheriantestof significance
forindependence
has a P716.5 with413 degreesof freedom.
this
valueofzeroto theprecisionofmycomputer,
evidence
model.
This
providing
strong
against
to thatfroma frequentist
has
no
clear
run
probability,
althoughequivalent
test,
long
interpretation
inclassicalfrequentist
inference.
De Angelisand Gilks (1994) used reasonablynon-informative
normaland inverseGaussian
in a second independence
model. They did not describehow these
priorsforthe parameters
(personal?)priorswereelicited.Theyonlyconsideredthesetworathersimilarmodels,givingno
of anyotherpossibility,
so theirpriorprobability
priorsforthefirst,
includingnon-stationarity,
thetestjust performed,
musthave been 0. This contradicts
but,giventhispriorbelief,such a
problemcannotbe detectedfromwithintheBayesianparadigm.
4.
Likelihood principle
In supportofbasinginferences
on thelikelihoodfunction,
one customarily
invokesthelikelihood
of thisis 'All information
about 0 obtainablefroman experiment
is
principle.One statement
inthelikelihoodfunction
contained
for0 givenx. Twolikelihoodfunctions
for0 (fromthesameor
different
containthesameinformation
about0 iftheyareproportional
tooneanother.'
experiments)
about0'
(BergerandWolpert
(1988),p. 19; see also Birnbaum
(1962)). Thephrase'all information
mustbe takenin theverynarrowsenseof parameter
The
estimation,
giventhemodelfunction.
Statistical
Heresies
9
principalconsequenceis thatsuchinferences
shouldonlydependon theobserveddataandhence
noton thesamplespace.The mostcommonexampleis theidentity
ofthebinomialandnegative
themodelshaveverydifferent
binomiallikelihood
functions
although
samplespaces.
If thisprincipleis takenliterally,
an interpretation
of themodelunderconsideration
becomes
impossible.Few if anymodelshave meaningoutsidetheirsamplespace. An important
use of
modelsis to predictpossibleobservations,
whichis impossiblewithoutthesamplespace. Also,
an essentialpartof inference,
the interpretation
of modelparameters,
dependson the sample
ofthebinomialdistribution
is notthesameas thatofthe
space.Forexample,themeanparameter
negativebinomialdistribution.
is
However,themajorproblemwiththisprincipleis thatit assumesthatthemodelfunction
knownto be true.The onlyinference
problemis thento decideon an appropriate
set of values
fortheparameter,
i.e. a pointestimateand precision.It is simplynotcorrectthat'by likelihood
function
we generallymean a quantitythatis presumedto containall information,
giventhe
data, aboutthe problemat hand' (Bj0rnstad,1996). The likelihoodprincipleexcludesmodel
criticism.
In fact,applyingthelikelihoodprinciplealmostinvariably
in
involvesdiscarding
information,
abouttheparameter
of interest.
thewidersensethansimplyestimation,
Forexample,thebinomial
canbe performed,
andtheprobability
eveniftheonlyinformation
experiment
estimated,
provided
by a studyis the totalnumberof successes,giventhe totalnumberof trials(trialsmay be
In contrast,
the negativebinomialexperiment
simultaneous).
can onlybe performed,
and the
whenwe knowtheorderofthesuccessesandfailures,
probability
can onlybe estimated,
whichis
necessaryto knowwhento stop.This information,
aboutwhetheror not the parameter
being
is constant
overtime,is thrown
forthenegative
estimated
awaywhentheusuallikelihoodfunction
binomialmodel is constructed.
is
(For the binomialmodel, when the orderinginformation
likelihoodfunctionfora negative
available,it obviouslyalso mustbe used.) An appropriate
binomialexperiment
mustuse thisinformation
andwillnotbe identicalwiththatfortheminimal
binomialexperiment.
Bayesianswhowishto criticizemodelsmusteitherresortto frequentist
methods(Box, 1980;
in somewidermodel(Box andTiao, 1973).
Gelmanetal., 1996) orembedtheirmodelof interest
to be trulyBayesian,itmust,however,
be specified
before
Forthelatter,
calledmodelelaboration,
is excluded.
lookingatthedata.Again,theunexpected
The appropriate
likelihood
Thus,thelikelihoodprincipleis hollow,ofno realuse in inference.
functionforan experiment
musttake into accountthe sample space. Basically different
exare performed
suchas thebinomialand negativebinomialexperiments,
fordifferent
periments,
so the corresponding
reasonsand can alwaysyield different
information
likelihoodsmustbe
detailanddiscussion,
see Lindsey(1997a).)
different.
(Forfurther
4. 1. Example
The likelihoodprinciple,appliedto the stationary
model above,allows us to make inferential
in thatmodel,assumingthatit is correct.But itprovidesno way
statements
abouttheparameters
of criticizing
thatmodel,in contrast
withtheFisheriantestused above,norof searchingfora
better
alternative.
5.
Likelihood function
Let us now look more closely at the definition
of likelihood.The likelihoodfunctionwas
defined(Fisher,1922),and is stillusuallyinterpreted,
to givethe(relative)probability
originally
10
J.K. Lindsey
of theobserveddataas a function
of thevariousmodelsunderconsideration.
Models thatmake
thedatamoreprobablearesaidtobe morelikely.
However,in almostall mathematical
statistics
texts,thelikelihoodfunction
is directly
defined
as beingproportional
totheassumedprobability
oftheobservations
density
y,
L(O; y) ocf(y; 0),
(2)
wheretheproportionality
constantdoes notinvolvetheunknownparameters
0. For continuous
likelihoodas therelativeprobability
of the
variables,thisis simplywrongifwe wishto interpret
of anypointvalue of a continuous
observeddata.The probability
variableis 0. Thus,equation
(2) does notconform
to theoriginaldefinition.
This changeof definition
has had ramifications
thetheoryof statistical
Forexample,at themostbenignlevel,it has led to
throughout
inference.
manyparadoxes.
One signof themathematization
of statistics,
and of its distancefromappliedstatistics,
is
thatinference
procedures,
fromtheempiricalto thegeneral,do notallow fortheelementary
fact
thatall observations
are measuredto finiteprecision.I am not speakingaboutrandommeasurementerrors,but about the fixedfinite,in principleknown,precisionor resolutionof any
If thisis denotedby A, thenthecorrectlikelihoodforindependent
instrument.
observations
of a
continuous
variable,accordingto Fisher's(1922) originaldefinition
(he was verycarefulabout
it),is
J
yi+Ai/2
L(O; y) -H
f(ui; O)dui.
(3)
yi-Ai/2
In virtually
all cases wherethelikelihoodfunction
is accusedof beingunsuitableforinference,
includingby Bayesianswho claimthata priorwill correcttheproblem,thereasonis themore
recentalternative
definition.
(The fewothersarisefroman incorrect
specification
ofthemodelor
datato studya model;see Lindsey(1996a),pages 118-120 and 139.)
inappropriate
A classicexampleofthesupposedproblemswiththelikelihoodfunction
is thatit(actuallythe
distribution
density)can takeinfinite
values,as forthethree-parameter
log-normal
(Hill, 1963).
Definedas in equation(3), thelikelihoodcan neverbecomeinfinite.
Indeed,anydatatransformationthatdoesnottakeintoaccountAi can lead totrouble.
Let us be clear;I amnotarguingagainstmodelsusingcontinuous
variables.Theseareessential
inmanyapplications.
withhowto drawcorrectinferences
Theproblemis rather
aboutthem,given
theavailableobservabledata.Nor am I suggesting
thatequation(2) shouldneverbe used; it has
beena valuablenumerical
of equation(3), although
approximation,
giventheanalyticcomplexity
itis rarelyneededanymore,withmoderncomputing
power.Thus,whenproblemsarise,andeven
iftheydo notarise,we shouldneverforget
thatthisisjustan approximation.
Of course,thisconclusionis ratherembarrassing
because suchbasic toolsas theexponential
fallbythewaysideforcontinuous
statistics
familyand sufficient
variables,exceptas approximations(see thediscussionin Heitjan(1989)). The sufficient
statistics
containonlyan approximation
in the data about a parameterin real observationalconditions.Bayesian
to the information
procedures
can also haveproblems:ifseveralindistinguishable
valuesarerecorded(whichis impossiblefora 'continuous'
variable),theposterior
maynotbe defined.
5. 1. Example
The detectionof AIDS cases is a processoccurringover continuoustime.Nevertheless,
for
it is discretized
intoconvenient
administrative
reporting,
intervals,
quartersin ourcase. Because
Statistical
Heresies
11
in
ofthehighrateofoccurrence
ofevents,thisdoesnotcreateproblemsin fitting
models,whether
discreteor continuous
time,forthequestionof interest.
The samesituation
also arisesin survival
studies,buttheretheeventrateis muchlowerso timeintervals
maybe ofmoreconcern.
6.
Normal distribution
Ifall observations
areempirically
discrete,
we mustre-evaluate
theroleofthenormaldistribution.
Thatthisdistribution
has longbeen centralbothin modellingand in inference
is primarily
the
faultofFisherinhis conflict
withPearson.Pearsonwas greatly
withdevelopingapproconcerned
In
priatemodelsto describeobservedreality,
buthe used crudemomentmethodsof estimation.
Fisherconcentrated,
noton finding
themostappropriate
contrast,
models,buton thebiasedness
(fromthedesign,notthefrequentist
bias discussedbelow)andefficiency
of statistical
procedures.
Fornecessarypracticalreasonsof tractability,
thatsomebelieveto continueto hold to thisday,
the normaldistribution
was centralto developinganalysesof randomizedexperiments
and,
moregenerally,
to efficient
inferences
(althoughtheexponential
and location-scalefamiliesalso
emergedfromhiswork,almostinpassing).
in appliedstatistics
Thus,a keyfactorin thecontinueddominanceof thenormaldistribution
has beenthistractability
forcomputations.
Thiswas overthrown,
in themodellingcontext,
bythe
introduction
of survivaland generalizedlinearmodels. Developmentsin computer-intensive
in inference.
methodsare leadingto thesame sortof revolution
Analytically
complexlikelihood
functions
can nowbe explorednumerically,
althoughthishas so farbeen dominated
by Bayesian
applications
(see,however,
Thompson(1994)).
Here,I proposethat,just as thenormaldistribution
is nottheprimary
basis,althoughstillan
it shouldnotbe thefoundation
important
one, of modelconstruction,
of inference
procedures.
it shouldbe thePoissondistribution.
Instead,I suggestthat,ifone distribution
is to predominate,
A Poissonprocessis themodelofpurerandomness
fordiscreteevents.Thiscan be usedto profit
as a generalframework
withinwhichmodelsareconstructed
andforthesubsequent
inferences.
All observations
are discrete,whateverthe supposedunderlying
data generating
mechanism
andaccompanying
model.Thus,all observations,
areevents.Current
whether
positiveornegative,
modellingpracticesfor continuousvariablesdo not allow forthis. The Poisson distribution
that
providesan interface
betweenthemodelforthedata generating
mechanismand inferences
takesintoaccounttheresolution
ofthemeasuring
instrument.
A model based on any distribution,
even the Poisson distribution,
for a data generating
can be fitted,
mechanism
to theprecisionof theobservations,
by suchan approachbased on the
Poissondistribution
forthemeasurement
modelwhen
process.Thisdistribution
yieldsa log-linear
themodelunderstudyis a memberoftheexponential
thelikelihoodbythe
family(approximating
forcontinuous
stochastic
density
variables)andwhenitis a multiplicative
intensity
process(using
a similarapproximation
forcontinuous
time),perhapsindicating
whythesefamiliesare so widely
used(LindseyandMersch,1992;Lindsey,1995a,b).
A Poissonprocessdescribesrandomevents.One basic purposeof a modelis to separatethe
of interest,
thesignal,fromtherandomnoise.Supposethatthecombination
of
systematic
pattern
a distribution,
whichever
is chosen,and a systematic
regression
partcan be developedin sucha
fortheoccurrence
of
waythatthemodeladequatelydescribesthemeanof a Poissondistribution
theeventsactuallyobserved.Then,we knowthatthemaximuminformation
aboutthepattern
of
interest
has beenextracted,
leavingonlyrandomnoise.
A principle
thathas notyetfailedme is thatanymodelorinference
is untrustworthy
procedure
if it dependson theuniqueproperties
of linearity
i.e. if it cannot
and/orthenormaldistribution,
be generalized
is centralto
directly
beyondthem.IfthepointofviewthatthePoissondistribution
12
J.K. Lindsey
statistical
inference
is adopted,whatwe thenrequireare goodness-of-fit
procedures
tojudge how
in thedatathatremains
farfromtherandomness
ofa Poissonprocessis thatpartofthevariability
unexplained
bya model.
6. 1. Example
As is well known,thePoissondistribution
formsthebasis forestimating
log-linearmodelsfor
tables.Thus,I implicitly
used thisin fitting
contingency
my stationary
model above and shall
againforthosethatfollow.However,as in mostcases, thisprocedurehas anotherinterpretation
(Lindsey,1995a). In thiscase, I am fitting
risk(hazardor intensity)
functions
fora pairof nonhomogeneous
(i.e. withrateschangingovertime)Poissonprocessesforincidenceand reporting
delays.Above,theywerestepfunctions
describedbyai and/gj.
Below,I shallconsidermodelsthatallow bothAIDS detectionand reporting
to varycontinestimated
as Poisson
uouslyovertimeinsteadoftakingdiscrete
jumpsbutthatare,nevertheless,
processes.In thecontextof such stochasticprocesses,a suitablemodelwill be a modelthatis
Poissonat anydiscretely
observedpointin time,giventheprevioushistoryof the
conditionally
process(Lindsey,1995b). Thatused below is a generalization
of a (continuoustime)bivariate
Weibulldistribution
forthetimesbetweenevents(AIDS detection
andreporting).
7.
Model selection by testing
One fundamental
questionremainsopen:howcan ratiosoflikelihoods
be comparedformodelsof
different
complexity,
i.e. whendifferent
numbers
ofparameters
areinvolved?How can likelihood
intheprobability
ofhypothesis
ratiosbe calibrated?
Thenumberofdegreesoffreedom
statements
this calibrationof the likelihood
testshas traditionally
fulfilled
this role. Beforeconsidering
moreclosely.Nester(1996) has
function
bysignificance
levels,letus first
lookat stepwisetesting
describedsomeoftheproblems
withhypothesis
admirably
testing.
Everyoneagreesthat,theoretically
fora testto be valid,thehypothesis
mustbe formulated
thedata.No one doesthis.(I exaggerate;
beforeinspecting
protocolsofphaseIII trialsdo,butthat
is nota modelselectionproblem.)In theirinference
course,students
aretaughtthisbasicprinciple
for
of hypothesis
testing;in theirregression
course,theyare taughtto use testingsequentially,
stepwiseselection.
testto the
Considerthecase wheretwostatisticians,
independently,
applythesamefrequentist
it in advance,the otherdoes not.Bothwill obtainthe same value.
same data. One formulates
at his or
However,accordingto standard
inference
theory,
onlythefirstcan rejectthehypothesis
herstatedprobability
theconclusionby collectingfurther
level.The secondmustconfirm
data,
and indeedthe level of uncertainty
reportedabout a givenmodel shouldreflectwhatprior
followformal
workwas carriedout withthosedata. However,if two statisticians
exploratory
to arriveat thesamefinalteststarting
fromthesedifferent
initialconditions,
Bayesianprocedures
thereasonmustbe theirdiffering
initialpriors.
are
suchas theX2-distribution,
Testsat a givenprobability
level,based on some distribution
level
evenifthatprobability
assumedtobe comparableinthesamewayin anysituation,
generally
has no meaningin itself,forexamplebecausewe havepreviously
lookedat thedata.However,in
in unmeasurable
a sequenceof tests,thedegreeof uncertainty
is accumulating
ways,becauseof
on thesame data (evenlookingat thedatabeforea firsttestwill
previousselectionsperformed
in different
levels
so heretheseriesof significance
modify
uncertainty
waysin different
contexts),
are certainly
not comparable.The probability
levels obtainedfromsuchtestsare meaningless,
evenforinternal
comparisons.
Statistical
Heresies
13
Bayesianmethodshave theirownproblemswithmodelselection,besidesthefactthatone's
mustbe alteredas soon as one looksat thedata.(Whatare theallowpersonalpriordistribution
able effectsof data cleaningand of studying
on one'spersonalprior?)With
descriptive
statistics
a priorprobability
of theparameter
densityon a continuous
parameter,
theprobability
being0,
andhencebeingeliminated
fromthemodel,is 0. SomeBayesianshavearguedthatthequestionis
ratherif theparameter
is sufficiently
close to 0 in practicalterms(see, forexample,Bergerand
thishave been proposed.For
Sellke (1987)) butno widelyacceptedmeansforoperationalizing
invariant?
example,whatsuitableintervals
aroundthiszero are parameterization
Placinga point
priorprobability
masson sucha model,as withBayesfactors,
thesituation,
aggravates
leadingto
Lindley's(1957) paradox:withincreasinginformation,
the modelwithpointpriorprobability
necessarily
dominates
thosehavingcontinuous
priorprobability.
7. 1. Example
A modelfordependencebetweenAIDS detectionand reporting
delaysis equivalentto a nonstationary
stochastic
processforreporting
delays:theshapeof thedelaycurveis changingover
time.Because of themissingtriangleof observations,
a saturated
modelcannotusefully
be fitted
to thiscontingency
table.Instead,considera slightly
simplerinteraction
modelthatis stillnonparametric
(Lindsey,1996b)
log(,uij)=ai
+ ,Bj+ bit + yju
(4)
wheret and u respectively
are thequarterand thedelaytimes.This yieldsa devianceof 585.4
with364 degreesof freedom.The curveis plottedas thebrokencurvein Fig. l(a). The deviance is improved
The P-valueof thecorresponding
testis
by 131.1with49 degreesof freedom.
2.0 X 10-9,apparently
strong
evidenceagainststationarity.
Here,I have decidedon the necessityto findsuch a model afterseeingthepoor fitof the
modelused by De Angelisand Gilks(1994). Whatthenis thelongrunfrequentist
stationary
(as
thefactthatwe do not
ofthissecondP-value,evenignoring
opposedto frequency)
interpretation
havea sampleandthatthestudycan neverbe repeatedinthesameconditions?
As a Bayesian,howcouldI handlesucha modelifI had foreseenitsneedbeforeobtaining
the
data?It contains101 interdependent
parameters
forwhicha personalmultivariate
priordistributionwouldhavehad to be elicited.ShouldI workwithmultiplicative
factorsfortheriskfunction
or additivetermsforthelog-risk?
The twocan givequitedifferent
intervals
ofhighest
credibility
How shouldI weightthe priorprobabilities
of the two
posteriordensityforthe parameters.
models,withand withoutstationarity?
How shouldI studythe possibleelimination
of the 49
of theirbeing0 would
dependence(interaction)
parameters,
giventhattheposterior
probability
alwaysbe 0 (unlesstheywereexcludedin advance)?
8.
Stepwise and simultaneous testing
I now return
to thecalibration
problem.A changein deviance(minustwicethelog-likelihood)
betweentwo models of, say,fivehas a different
meaningdependingon whetherone or 10
are fixedin the second model in relationto the first.It has been argued,and is
parameters
such as the %2-distribution,
commonlyassumed,thata probability
distribution,
appropriately
calibrates
thelikelihoodfunction
to allowcomparisons
numbers
ofdegreesof
undersuchdifferent
freedomeven thoughthe P-value has no probabilistic
forthe reasonsdiscussed
interpretation
above.Thisassumption
is notoftenmadeclearto students
andclients.
Let us considermorecloselya simpleexampleinvolvingdifferent
numbersof degreesof
14
J.K. Lindsey
by testing,
applicable
freedom,
thatcomparingstepwiseand simultaneous
parameter
elimination
credibility
or conto bothfrequentist
and Bayesianprobability
statements.
Noticethatexamining
test.Takea modelforthe
fidenceintervals
wouldyieldthesameconclusionsas thecorresponding
meanofa normaldistribution,
havingunitvariance,defined
by
-,
ai + fij
witheach of two explanatory
variablestakingvalues 1 and -1 in a balancedfactorialdesign.
Witha likelihoodratiotest(here equivalentto a Wald test),the square of the ratioof each
errorwill havea %2-distribution
and,
parameter
estimateto itsstandard
with1 degreeof freedom
with2 degreesof
because of orthogonality,
thesumof theirsquareswill have a %2-distribution
underthenullhypotheses
thattheappropriate
freedom
parameters
are 0. First,we are comparing
a modelwithone parameter
at a time0 withthatwithneither0; then,both0 withneither0.
priorsforthetwo
EquivalentBayesianprocedurescan be givenusingany desiredindependent
parameters.
or construct
separateprobability
Supposethatwe perform
separatetestsforthe parameters
forthem,and thatwe findthattheindividualteststatistics
havevalues3 and 3.5. These
intervals
indicatethatneitherparameter
is significantly
different
from0 at the5% level,or that0 lies in
each interval.Because of orthogonality,
one parametercan be eliminated,then the other,
on thefirst
a simultaneconditionally
being0, in eitherorder.However,if,instead,we performed
ous testforbothbeing0, or constructed
a simultaneous
regionat thesame level as
probability
wouldhavea value of 6.5 with2 degreesof freedom,
indicating,
againat the
above,thestatistic
arisesin practice,it is
5% level,thatat leastone couldnotbe eliminated.
Whensucha situation
difficult
to explainto a client!
extremely
I call suchinference
and to
procedures
incompatible.
The criticism
appliesbothto frequentist
Bayesianprobability
statements.
Frequentist
multiple-testing
procedures,such as a Bonferroni
thatare so oftenincorrectly
worse.The
onlymakematters
correction,
appliedin sucha context,
two tests for individualeliminationwould be made at the 2.5% level, providingeven less
testwouldremain
indication
thateitherparameter
was different
from0, whereasthesimultaneous
unchanged.
Whatexactlyis theproblem?Calibrationof the likelihoodfunction
by fixinga probability
forvaryingnumbersof parameters
inducesan increasingdemand
level,Bayesianor frequentist,
as model complexity
on theprecisionof each parameter
increases.A givenglobal probability
level necessarilytightensthe precisionintervalaroundeach individualparameteras the total
is testedforelimination
numberofparameters
involvedincreases.Forexample,ifone parameter
or confidence
intervals
modelwill generally
at a time,or individualcredibility
used,theresulting
ora simultaneous
aretestedforremovalsimultaneously,
be muchsimplerthanifmanyparameters
arebiasedtowardscomplexmodels.
regionat thesamelevelused.Suchprocedures
mustbe heldconstant
Whatthenis thesolution?The precisionof each parameter
(at leaston
itcan easilybe demoninferences,
average),insteadoftheglobalprecision.To avoidincompatible
as proportional
to aP, where0 < a < 1 and
strated
thatthelikelihoodfunction
mustbe calibrated
or on average,
estimatedin themodel.(This is onlyapproximate,
p is thenumberof parameters
i.e. whenthelikelihoodfunction
doesnotfactor.)
arenotorthogonal,
whenparameters
In model selection,this constanta is the likelihoodratiothatwe considerto be on the
for
thatoneparameter
can be setto a fixedvalue(usually0). Moregenerally,
borderline
indicating
itis theheightofthenormedlikelihoodthatone choosesto definea likelihood
interval
estimation,
intervalforone parameter.
heightto definea simultaneous
Then,aP will be thecorresponding
thatis compatible
withthatforoneparameter.
regionforp parameters
In theprocessof simplifying
makethedataless probable(unlessthe
a model,we necessarily
Statistical
Heresies
15
parameterbeing removedis completelyredundant).This value a is the maximumrelative
of thedatathatwe are willingto acceptto simplify
a modelby removing
decreasedprobability
one parameter.
The smalleris a, thewiderare thelikelihoodregionsand thesimplerwill be the
model chosen,so thatthismaybe called the smoothing
constant.Well-known
model selection
criteria,such as theAIC (Akaike,1973) and the Bayes information
criterion
(BIC) (Schwarz,
1978),aredefined
inthisway.Suchmethodsdemystify
thewidelybelieved,andtaught,
frequentist
myth
thatmodelsmustbe nestedto be comparable.
How havesuchcontradictory
methodssurvivedso long?In manycases ofmodelselection,the
numberofdegreesoffreedom
is smallandaboutthesameforall tests,so thedifferences
between
model selectionconclusionsare not
thetwoprocedureswill thenbe minimal.Most important
borderline
cases so theresultswouldbe thesameby bothapproaches.In addition,mathematical
statisticians
have done an excellentjob of convincing
appliedworkersthattheoverallerrorrate
shouldbe heldconstant,
insteadofthemoreappropriate
individual
parameter
precision.Although
theformer
maybe necessaryin a decisiontestingsituation,
especiallywithmultipleendpoints,it
in a modelselectionprocess.
is notuseful,orrelevant,
I can treatone further
difficult
Aftermodelselectionhas beencompleted,
it
issueonlybriefly.
of interest.
This raisesthe
maybe desirableto make inferences
aboutone particular
parameter
in thepresenceof non-orthogonality
questionof how to factorlikelihoodfunctions
(Kalbfleisch
and Sprott(1970), Sprott(1975a, 1980) and Lindsey(1996a), chapter6). Parameter
transformationcan helpbutwe mayhaveto acceptthatthereis no simpleanswer.The Bayesiansolutionof
out the 'nuisance' parametersis not acceptable:a non-orthogonal
averagingby integrating
likelihoodfunction
is indicating
thattheplausiblevaluesoftheparameter
of interest
change(and
on thevaluesofthenuisanceparameters.
thushavedifferent
Undersuch
interpretation)
depending
an averageis uninterpretable
conditions,
(Lindsey(1996a),pages349-352).
8. 1. Example
modelexploration,
transformations
of timeand
Aftersome further
usingsome ratherarbitrary
I
a
of
of
have
found
a
continuous
timeparanumber
non-nested
involving
comparisons models,
metricmodelallowingfornon-stationarity
(Lindsey,1996b):
+ dI log(t)+ 42/t+
log([Uij)=0
+
01 log(u) +
02/u+ VIt/u+ v2tlog(u)
V3 log(t)/u+ V4 log(t)log(u).
thelog-transformation
oftimealoneyieldstheWeibullmodel.)Thismodelhas a deviance
(Fitting
of 773.8 with456 degreesof freedom.It has onlynineparameters
comparedwith101 in the
model(1). It is a simplification
ofthatmodelwitha deviance188.4
nonparametric
non-stationary
largerfor92 degreesoffreedomso theP-valueis 1.3 x 10-8,seemingto indicatea veryinferior
model.However,its smoothcurvethroughthe data,plottedas the dottedcurvein Fig. l(a),
appearsto indicatea good model both fordescribingthe evolutionof the epidemicand for
prediction.
Witha fairlylargeset of 6215 observations,
a = 0.22 mightbe chosenas a reasonablevalue
theintervalof precisionforone parameter;
fortheheightof thenormedlikelihooddetermining
thisimpliesthatthedeviancemustbe penalizedbyaddingthree(equal to -2 log(0.22))timesthe
numberof parameters.
(This a is smallerthanthatfromtheAIC: a = I/e = 0.37 so twicethe
wouldbe addedto thedeviance.It is largerthanthatfroma X2-test
at 5%
numberof parameters
a = 0.14, or adding3.84 timesthenumberofparameters,
with1 degreeof freedom,
but,withp
nonthisdoes not changeto 0.14p.) For the stationary
parameters,
model,the nonparametric
16
J.K Lindsey
modelandtheparametric
872.5,989.4
model,thepenalizeddeviancesarerespectively
stationary
forthe last. (Such comparisonsare onlyrelativeso
and 800.8, indicatinga strongpreference
wouldyieldthesame results.)This confirms
theconclusionsfromour
penalizedlog-likelihoods
intuitive
inspection
ofthegraph.
inno waythe'true'modelforthephenomenon
giventhearbitrary
Thisis certainly
understudy,
transformations
of time.However,it appearsto be quiteadequateforthequestionat hand,given
modelselectionexamples.)
theavailabledata.(See LindseyandJones(1998) forfurther
9.
Equivalence of inference procedures
give aboutthe
Some statisticians
claimthat,in practice,thecompeting
approachesto inference
One mayarguethat,just as
sameresultsso itis notimportant
whichis used ortaughtto students.
do a reasonably
procedures,
butthattheybothnevertheless
modelsareimperfect,
so areinference
withouttrying
to
goodjob. However,just as we wouldnotstickwithone modelin all contexts,
in all contexts,
norrefuse
to applyone modeof inference
improveon it,so we shouldnotattempt
are extremely
difficult
to checkempirically,
procedures
improvements.
Unlikemodels,inference
andthereis no agreement
on theappropriate
criteria
to do this.
it is truethattheconcluBayesianand frequentist
procedures,
Although,
formanycompeting
sionswill be similar,in ourgeneralexploratory
contextall methodsare notalike.Forsmalland
as describedabove,givevery
propermodelselectioncriteria,
largenumbers
ofdegreesoffreedom,
different
resultsfromthosebased on probabilities.
(The standardAIC is equivalentto a %2-test
orinterval
atthe5% levelwhenthereare7 degreesoffreedom.)
can reasonablycalibratethe
statements
For thosewho would stillmaintainthatprobability
considerthe
situations,
likelihoodfunction,
especiallygiventheirbehaviourin highdimensional
following
questions.
(a) The deviancecan providea measureof thedistancebetweena givenmodeland thedata
modelsforcontingency
%2-distribution.
tables)andhas an asymptotic
(e.g. withlog-linear
We wouldexpectmodelsthatare morecomplexto be generallycloserto thedata.Why
thendoestheprobability
ofa smalldeviance,indicating
a modelclose to thedata,become
increases?(Thiswas a questionaskedbya
verysmallas thenumberofdegreesoffreedom
sociologystudent.)
second-year
undergraduate
in multidimensional
ofteninvolvesintegration
spaces.
(b) The calculationof probabilities
unithypersphere
changeas p
How do the surfacearea and volumeof a p-dimensional
mathematics
increases?(Thisis a standard
undergraduate
question.)
statements.
formodelcomparison
inferences
without
probability
Herethenis mycompromise,
thefrequentist
likelihoodratioor Bayes
Take theratioof likelihoodsfortwomodelsof interest,
we haveno agreed
factor.
Maximizeeachoverunknown
parameters,
because,amongotherthings,
invariant
for
and integration,
withwhatever
measureon parameters
prior,is notparameterization
aboutprecision.To avoidincompatibility,
comintervalstatements
makingeasilycommunicable
to aP, with
bine each likelihoodwitha 'prior'thatpenalizesforcomplexity,
beingproportional
overwhich
a and p definedas in the previoussection.Here, p is the numberof parameters
was performed;
a shouldbe specifiedin theprotocolforthe studyand mustbe
maximization
withthesamplesize,as discussedbelow.Thisis the'non-informative,
objectiveprior'
congruent
for
The resulting
oddsthenprovidesuitablecriteria
foreachmodelunderconsideration.
posterior
witheach otherand
All conclusionsaboutmodelselectionwillbe compatible
modelcomparison.
inthefinalmodel.
likelihoodregionsforparameters
with(simultaneous)
or by Bayesiancoherency
In contrast
withevaluationby frequentist
long runcharacteristics
Statistical
Heresies
17
thisprocedure
will indicatethemostappropriate
amongconclusions,
setofmodelsforthedataat
hand(including
likelihoodfunctions
frompreviousstudies,whereavailableandrelevant).
Thisset
of models,amongthoseconsidered,
makestheobserveddata mostprobablegiventheimposed
constraint
on complexity.
Hence,we havea thirdcriterion
forinferential
evaluation.
Thisseemsto
me to providea betterguarantee
of scientific
generalizability
fromtheobserveddatathaneithera
longrunora personalposterior
probability.
of inference
Thus,we havea hierarchy
procedures:
modelfunction
and thenumberof parameters
(a) if theappropriate
requiredare notknown
beforedata collection,and one believesthatthereis no truemodel,use theabove procedureforsomefixedvalueofa (a generalization
oftheAIC) to discoverthesetofmodels
thatmostadequatelydescribesthedataforthequestionathand;
modelfunction
is notknownbeforedatacollectionbutone believesthat
(b) iftheappropriate
thereis a truemodel(fora Bayesian,it mustbe containedin the set of functions
to be
use theBIC totryto findit;
considered),
onehas a modelfunction
(c) if,beforedatacollection,
thatis knownto containthetruemodel,
use probability-based
Bayesian or frequentist
proceduresderivedfromthe likelihood
function
to decideonparameter
andtheiraccompanying
estimates
intervals
ofprecision.
The difference
betweenBayesianand frequentist
procedures,
each based on its interpretation
of
is less thanthatbetweeneitherof theseand model selectionmethods
inference
probabilities,
basedon appropriately
penalizedlikelihoods.
9. 1. Example
We saw,in theprevioussection,thatthepenalizeddevianceclearlyrejectsthenonparametric
nonmodelcomparedwiththeothertwomodels.However,we havealso seenthatFisherian
stationary
mannerin thismodelselectioncontext)strongly
tests(incorrectly
used in a frequentist
rejectboth
the stationary
and the parametric
models in favourof nonparametric
These
non-stationarity.
conclusionsare typicalof the differing
inferences
contradictory
producedthroughdirectlikelihoodandprobability-based
are involved.In such
approacheswhenlargenumbersofparameters
a penalizedlikelihood,
suchas theAIC, will selectsimplermodelsthanclassical
circumstances,
inferences
basedon probabilities,
whether
Bayesianorfrequentist.
10. Standarderrors
Hypothesis
testinghas recently
come underheavyattackin severalscientific
disciplinessuchas
psychologyand medicine;see, forexample,Gardnerand Altman(1989). In some scientific
whichare feltto be more
journals,P-valuesare beingbannedin favourof confidence
intervals,
thisis nottrue;P-valuesand confidence
informative.
(In thefrequentist
setting,
intervals,
based
on standarderrors,are equivalentif the pointestimateis available.) Thus, the formulation,
=5.1 (standarderror2.1), is replacingthe equivalent0 =5.1 (P= 0.015). Unfortunately,
believethattheonlywaythata statistician
is capableof providing
a measureof
manyscientists
to convince
estimateis through
itsstandard
error.It is oftenverydifficult
precisionofa parameter
evensomehighlytrainedappliedstatisticians
thatparameter
precisioncan be specifiedintermsof
the(log-) likelihood.Also, standardstatistical
packagesdo notsupplythelatterlikelihood-based
andthesecan be difficult
to extract.
intervals,
For non-normal
models,standarderrors,withthe sizes of samplesusuallyrequired,can be
itis oftenvery
thelikelihoodfunction,
extremely
misleading;at thesametime,without
inspecting
18
J.K. Lindsey
approximations
are
difficult
to detectthespecificcases,perhapsone timein 20, whenasymptotic
errorprovidesa quadraticapproximation
so faroffthatsucha problemhas occurred.The standard
to theprecisionobtaineddirectlyfromthelog-likelihood
function
fora givenparameterization
(Sprottand Kalbfleisch,
1969; Sprott,1973, 1975b).As is well known,it is notparameterization
invariant.
It is less widelyknownjust how muchtheprecisionintervaldependson theparamitcan be, especiallyiftheestimateis neartheedgeof
eterization
andhowpooran approximation
theory.
theparameter
space,whichis another
weaknessoffrequentist
asymptotic
whichparameters
maynotbe
Standard
errorsarea usefultoolin modelselectionforindicating
fordrawingfinalinference
necessary.Outsidelinearnormalmodels,theyare not trustworthy
conclusionsabouteitherselectionor parameter
theyare demandedby
precision.Nevertheless,
manyreferees
and editors,evenofreputable
statistical
journals.Iftheycometo replaceP-values,
intervalsof precision
they,in turn,will have to be replacedin a fewyearsby moreappropriate
basedon thelikelihoodfunction.
10.1. Example
standarderrorsprovidereasonableapproximations
Because of thelargenumberof observations,
a binarylogisticregression
fortheAIDS data.Instead,considera (real) case of fitting
involving
thesoftware
1068 observations
thathappensto be at hand.In a modelwithonlyfourparameters,
errorgivento be 9.46; thatwouldapparently
calculatesone ofthemtobe 5.97 withstandard
yield
of 0.40. However,whenthisparameter
an asymptotic
is removed,
thedevianceincreases
%2-value
thisis a pathological
by 3.06, a ratherdifferent
asymptotic
X2-value.Of course,fora frequentist,
is on theboundary
oftheparameter
with
examplebecausea parameter
space,as someonefamiliar
this
logisticregression
mightguessfromthesize of thecalculatedstandarderror.Nevertheless,
is
can catchtheunwaryandmuchmoresubtlecases arepossiblewhenever
thelikelihoodfunction
skewed.
In contrast,
changesin likelihoodare notmisleading(exceptwhenjudged asymptotically!),
evenon theboundaryof theparameter
abouttherelative
space; theyare stillsimplystatements
models.Thus,thevalueof3.06 is thepropervalue
oftheobserveddataunderdifferent
probability
to use above(theother,0.40, notevenbeingcorrectly
calculated).The normedprofilelikelihood
is plottedin Fig. 2, showingthatan AIC-basedinterval(a =1 /e) excludes0.
forthisparameter
between
HandandCrowder(1996),pages 101 and 113,providedotherexamplesofcontradictions
from
likelihood
ratios
and
from
standard
errors.
inferences
11. Sample size
from
of designinga studyand drawinginferences
Let us nowconsidertheopposingphilosophies
is premisedon theidea thatthelargerthesample
it.The theoryofalmostall statistical
procedures
we shall knowthe correctmodel exactly.Appliedstatisthebetter;withenoughobservations,
incorrect.
Theywantthe smallest
ticians,whenplanninga study,knowthatthisis completely
to answerthequestion(s)underconsamplepossiblethatwill supplytherequiredinformation
sideration
at theleastcosts.The onlythingthata largesampledoes is to allow thedetectionof
whilewastingresources.
effects
thatareofno practicalimportance,
Samplesize calculationsare criticalbeforeundertaking
anymajorstudy.However,thesesame
who carefullycalculatea sample size in theirprotocoloftenseem to forgetthis
statisticians
important
point,that samples shouldbe as small as possible,when theyproceedto apply
to thedatathattheyhavecollected.
inference
asymptotic
procedures
controlled
thecomplexity
Exceptperhapsin certaintightly
physicalandchemicalexperiments,
StatisticalHeresies
19
0
0
/
0
E
o -
------------------------
z o (DI
C-J
-2
0
4
2
6
p
Fig. 2. Normedprofilelikelihood(
interval
of a = 1/e (-)---- )
) forthe logisticregressionparameterwiththe standardAIC likelihood
of themodelrequiredadequatelyto describea data generating
mechanismincreaseswiththe
samplesize. Considerfirstthecase whereonlya responsevariableis observed.For samplesof
reasonablesize,one ormoresimpleparametric
probability
distributions
can usuallybe foundto fit
well.But,as thesamplesize grows,morecomplicatedfunctions
will be requiredto accountfor
the more complexformbeing revealed.The marginalization
hypothesisproves increasingly
inadequate.Thus,the secondpossibilityis to turnto subgroupanalysisthrough
regression-type
models.The moreobservations,
themoreexplanatory
variablescan be introduced,
and themore
At each step,themarginalization
mustbe estimated.
thefoundation
ofthe
parameters
hypothesis,
model,is modified.
Thus, this growingcomplexitywill generallymake substantively
uninteresting
differences
detectablewhile oftenhidingthe verypatternsthata studysets out to discover
statistically
(althoughit may,at thesame timeas makingtheoriginalquestionsimpossibleto answer,raise
new questions).Social scientists
withcontingency
tablescontaining
important
largefrequencies
are well awareof thisproblemwhentheyask statisticians
how to adjustthe%2-values,
and the
resulting
P-values,to givesensibleanswers.The solution,whenthesamplesize,forsomeunconin a propermodel
trollablereason,is too large,is to use a smallera, the smoothing
constant,
selectionprocedure,
as describedabove.
The detectionof practically
usefulparameter
differences
is directly
relatedto the smoothing
whichfixesthelikelihoodratioandthetwotogether
determine
thesamplesize. Withthe
constant,
difference
of interest
parameter
specifiedby the scientific
questionunderstudy,the smoothing
constant
andsamplesize shouldbe fixedintheprotocolto havevaluesinmutualagreement.
Fora
effectof interest,
givensize of thesubstantive
largersampleswill requiremoresmoothing,
i.e. a
This is the directlikesmallerlikelihoodratiocomparingmodelswithand withoutdifference.
lihoodequivalent
offrequentist
powercalculations(Lindsey,1997b).
inference
Asymptotic
procedures,
assumingthatthe samplesize goes to oo, are misplaced.
Theironlyrolecan be in supplying
to thelikelihoodfunction,
certainnumericalapproximations
or itsfactorization,
whenthesecan be shownto be reasonableforsufficiently
smallsamplesizes.
to conform
moreclosely
Asymptotic
frequentist
tests,iftheyare to be used,shouldbe corrected
to the(log-) likelihoodfunction,
whichshouldneverbe corrected
to conform
to a test.Bayesians
20
J.K. Lindsey
in thelikelihooddominating
cannottakecomfort
theirpriorasymptotically
(unlesstheyhavethe
truemodelfunction).
Unnecessarily
largesamplesare bad. The demandsby appliedstatisticians
foraccuratesamplesize calculationsand forexact small sampleinference
proceduresconfirm
this.
11.1. Example
ConsidernowtheAIDS datafortheUSA forwhichthereare 132 170 cases,comparedwithonly
6215 forEnglandand Wales. Withso manyobservations,
the mostminutevariationswill be
detectedbyusingstandard
inference
The samethreemodels,as estimated
fromthese
procedures.
data,are plottedin Fig. 1(b) (Hay and Wolak (1994) used the same nonparametric
stationary
modelas De Angelisand Gilks (1994), shownby thefullcurve).Again,theparametric
model
appearsto be mostsuitablefortheprediction
problemat hand.The nonparametric
non-stationary
model,themostcomplexof thethree,fluctuates
wildlyneartheend of theobservation
period.
in deviancebetweenthisandthestationary
However,
thedifference
modelis 2670 with47 degrees
in favourofthisunstablemodel.The parametric
offreedom,
modelis
clearlyrejecting
stationarity
evenmoreclearlyrejected:5081 with88 degreesoffreedom.
Withthepowerprovidedby thismanyobservations,
a
greatdetailcan be detected,
requiring
to the questionat hand. Hence, I would have to penalize the
complexmodel,but irrelevant
likelihoodverystrongly
to obtaina simplemodel.Fortheparametric
modelto be chosenoverthe
othertwo,thedeviancewouldhaveto be penalizedbyadding60 timesthenumberofparameters
tothedeviances,corresponding
to a = 10-13.
12.
Consistency
The mosteffective
fromusinga modelor methodis to
wayto discouragean appliedstatistician
inconsistent
Thisis completely
irrelevant
for
estimates.
saythatitgivesasymptotically
parameter
a fixedsmallsample;theinterval
ofplausiblevalues,nota pointestimate,
is essential.Fora point
themorepertinent
Fisherconsistency
is applicableto finitepopulations.If thesample
estimate,
so thequestionofa paramwerelarger,in a properly
themodelcouldbe different,
plannedstudy,
to a 'true'valuedoesnotarise.
eterestimatein somefixedmodelconverging
asymptotically
even a mean,has littleor no valid interpretation
outsidethe specific
However,a parameter,
modelfromwhichitarises.Forcountsofevents,themeanofa Poissondistribution
has a basically
fromthe interpretation
of the mean of its generalization
to a negative
different
interpretation
binomialdistribution;
a givenregressioncoefficient
changesmeaningdependingon the other
leads
variablesin themodel(unlesstheyare orthogonal).
explanatory
Ignoringthesedifferences
in
to contradictions
andparadoxes.The choiceofa newcarbymeanautomobilefuelconsumption
milespergalloncan be contradicted
bythatmeasuredin litresper 100 kilometres
(Hand,1994) if
thedistributions
towhichthemeansreferarenotspecified.
Thus,ifthemodelis heldfixedas thesamplesize grows,it will notadequatelydescribethe
of interest
will generally
observations.
But,if it is allowedto becomemorecomplex,parameters
be estimated
will changeinterpretation.
Information
inconsistently,
and,muchmoreimportantly,
cannotbe accumulated
abouta parameter.
asymptotically
The essentialquestion,in a properly
themodeldescribeswell the
designedstudy,is whether
observeddatain a waythatis appropriate
to thequestionbeingasked.Forexample,a fixedeffects
modelcanbe moreinformative
thanthecorresponding
randomeffects
theformer
model,although
suchas the
estimates.
The fixedeffects
modelallowscheckingof assumptions
yieldsinconsistent
formof thedistribution
of effectsand thepresenceof interaction
betweenindividualsand treat-
Heresies
Statistical
21
ments.The relativefitof thetwoto thegivendata can be comparedby appropriate
directlikelihoodmodelselectioncriteria.
Ironically,
virtually
all thecommonly
used estimates
forparameters
in modelswithcontinuous
responsevariablesare inconsistent
foranyreal databecausetheyare based on theapproximate
likelihoodof equation(2). Theseincludealmostall thecommon,in factapproximate,
maximum
likelihoodestimates.
The existenceof sucha lack of consistency
has been knownfor100 years.
Sheppard's(1898) correction
providestheorderoftheinconsistency
forthevarianceof a normal
Fisherwas carefulaboutthis,it has been conveniently
in recent
distribution.
Although
forgotten
work.
asymptotic
Frequentists
havealso beenmuchconcernedwithstatistically
unbiasedestimates(as opposed
to designbiases thatare usuallyfarmoreimportant).
The criticisms
of thisemphasisare well
known.For example,the unbiasedestimateof the varianceis no longerunbiasedwhentransformed
to a standard
thevalue mostoftenrequired.Interestingly,
unbiasedness
deviation,
adjusts
the maximumlikelihoodestimateof the normalvarianceso thatit becomes larger,whereas
forconsistency
makesit smaller.Fora fixedmodel,thebiasednessofmost
Sheppard'scorrection
estimates
theinconsistency
due to theapproxidisappearswithincreasedsamplesize; in contrast,
mationof equation(2) increasesin relativeimportance,
becauseoftheincreasedprecisionof the
estimate,
as do designbiases.
12.1. Example
In contrast
withmorestandardcontingency
tableanalyses,herethesize of thetablegrowswith
thenumberofobservations,
as timegoes by.It willalso increaseifthetimeinterval
is takento be
smallerthanquarters.
is notfixed,
Hence,forthenonparametric
models,thenumberofparameters
raisingquestionsofconsistency.
ForthedatafromtheUSA, thelargesamplesize wouldallowone to modelmuchmoredetail
if thatweredesired.For example,seasonalfluctuations
mightbe studied.However,by standard
inference
criteria,
onlythesaturated
modelfitswell,althoughitcannotprovidepredictions
forthe
andhencecannotanswerthequestionsof interest.
recentperiodwhenthereis themissingtriangle
13.
Nonparametric models
is dominated
Much of modernappliedstatistics
by thefearthatthemodelassumptions
maybe
of
thefactthatall suchassumptions
wrong,ignoring
necessarilyare wrong.This is a symptom
the lack of adequatemodel comparisonand model checkingtechniques,forlookingat model
This trendis fuelledby the fact,discussedabove, that
appropriateness, in currentstatistics.
values without
testsand confidenceintervalscannotbe used to examineparameter
frequentist
thestochastic
ofthemodelwhereasBayesianprocedures
simultaneously
questioning
assumptions
tobe correct.
mustassumetheoverallmodelframework
This fear has led to the widespreaduse of nonparametric
and semiparametric
models,
especiallyin medical applicationssuch as survivalanalysis.These, supposedly,make fewer
assumptions,
meaningfewerthatare wrong.However,thisis an illusionbecausetheassumption
thatwill
formof a modelis completely
unknownis just that:an assumption
thattheparametric
oftenitselfactuallybe wrong(Fisher(1966), pages 48-49, and Sprott(1978)). We studya
suchas theCauchy
in meansnonparametrically.
Do we knowthata stabledistribution,
difference
is notinvolved?
distribution,
formofthe
A nonparametric
modelis generally
understood
to be a modelwherethefunctional
has an infinite
numberof parameters;
fora
stochasticpartis not specifiedso it theoretically
22
J.K. Lindsey
frequentist,
thisprovideschallenginginference
problems.Paradoxically,
a model in whichthe
formof the systematic
functional
regression
partis not completelyspecified,so it potentially
containsan infinite
numberofparameters,
is notcallednonparametric;
thefrequentist
can handle
it routinely
(by conditioning)
because it does not directlyinfluenceprobabilistic
conclusions.
Thus,a classicalanalysisofvariancewitha factorvariablerepresenting
a continuous
explanatory
buta Cox (1972) proportional
variableis parametric
hazardsmodel,witha similarrepresentation
forthebase-linehazard,is semiparametric.
Fromthepointofviewofthelikelihoodfunction,
that
thereis no difference
onlytakesintoaccounttheparameters
actuallyestimated,
betweenthetwo.
One majoradvantageof inferences
based on parametric
modelsis thattheyallowall assumpIn contrast,
tionsto be checked,theavailabledatapermitting,
andto be improved.
nonparametric
modelsfollowthedatamoreclosely,leavinglittleorno information
formodelcheckingthatcould
indicateimprovements.
However,the two can be complementary:
a nonparametric
model can
serveas a basis forjudgingthegoodnessoffitofparametric
models.Ifnoneofthelatteris found
one can fallbackon usingtheformer
untilfurther
appropriate,
knowledge
has accumulated.
It is commonly
believedthatno meansexistfordirectly
comparing
parametric
and nonparametricmodels,evenforthesamedata.At best,diagnostics
can be compared.However,fromany
properly
specifiedstatistical
model,theprobability
of theobserveddata can be calculated;the
most'nonparametric'
model simplyhas equal probability
masses forall observations.
In other
models can be directlycompared
words,the likelihoodsof parametricand nonparametric
(Lindsey,1974a,b).
This,however,
leavesasidetheproblemofthevastlydifferent
numbers
ofparameters
estimated
in modelsof the two types.If the likelihoodfunction
is calibratedby means of a probability
anditsnumber
in sucha context,
distribution
ofdegreesoffreedom
(andthefrequentist
problems,
resolved),nonparametric
modelswillalmostinvariably
win.Suchcomparisons
usinga probability
calibrationgreatlyfavourcomplexmodels forthe reasonsdiscussedabove. In contrast,
with
properly
compatiblemodelselectionmethodsbased on al, parametric
modelswill generally
be
and oftensuperior,
competitive
dependingon theamountof scientific
knowledgethatis available
to developthemodel,as well as on thesize of a chosen,i.e. thedegreeof smoothing
required.A
is greater
model
be
in
the
its
likelihood
will judgedsuperior
parametric
if,
simplestcase, penalized
the
n
distinct
observathana"-'I/nn; thisis
likelihoodfor
penalizednonparametric
independent
with
at
of
information
that
models
tions
a pointmass each.Because theincreased
goodparametric
oftheirpossiblelackoffitcomparedwitha nonparametric
an indication
provide,including
model,
thecomparison
is alwaysworthmaking.
13.1. Example
the development
of my example,I have been directlycomparingparametric
and
Throughout
model
models.We haveseenhowtestingled to rejectionofthesimplerparametric
nonparametric
withthesame delayj for
forbothdata sets.In thesaturated
model,all nij observations
reported
thesame quarteri haveprobability
mass l /nij; combining
theseyieldsitsmaximizedlikelihood.
modelforthesedata.ForEnglandandWales,
Thus,thismodelis themostgeneralnonparametric
to as manyparameter
thereare 465 entriesin thetablecorresponding
estimatesin thesaturated
modelthathas zerodeviance.Thusthepenalizeddevianceis 1395,whichis muchpoorerthanfor
theothermodels,as seenfromthevaluesgivenabove.
14.
Robustness
A second answerto the fearof inappropriate
model assumptionhas been to look forrobust
StatisticalHeresies
23
and
inference
procedures.
Much of thissearchhas occurredin thecontextof pointestimation,
thenormaldistribution
decisionproblems,
thatarenotthesubjecthere.Thus,in manysituations,
(Box and Tiao
and the t-testare widelyconsideredto be robustundermoderatenon-normality
(1973),p. 152).
is thatparameterestimatesare not being undulyinfluenced
by
One symbolof robustness
are hauntedby thepossibility
of outliers.They
extremeobservations.
Some appliedstatisticians
to be valid,insteadof modifying
believethattheymustmodifyor eliminatethemforinferences
Outliersare
themodel(iftheobservations
arenotin error)or acceptingthemas rareoccurrences.
oftenthemostinformative
partof a data set,whetherby tellingus thatthe data werepoorly
mask
procedures
shouldnotautomatically
collectedorthatourmodelsare inadequate.Inference
oradjustforsuchpossibilities
butshouldhighlight
them.
andotherdiagnosticresults,are definedonlyin termsof somemodel.Unfortunately,
Outliers,
in thesamewayas fora testrejectinga nullhypothesis,
theyneverindicateone uniquewayof
theproblem.Possiblescientific
models
correcting
waysto proceedincludelookingat competing
or havingsome moreglobal theoretical
modelwithinwhichthe model underconsideration
is
in whichcases directlikelihoodmethodsare applicable.However,ad hoc patchingof
embedded,
thecurrent
modelis probably
morecommon.
Manystatisticians,
especiallyfrequentists
forreasonsdevelopedearlier,professa preference
forresultsthatare insensitive
thatare notdirectly
of interest.
In contrast,
I
to anyassumptions
claimthatbotha goodmodelanda good inference
shouldbe as sensitiveas possibleto
procedure
that
theassumptions
beingmade.Thus,a robustmodelwill be a modelthatcontainsparameters
allowsit to adaptto, and to explain,thesituation(Sprott,1982). This will thenprovideus with
givingindications
of meansbywhichthe
waysto detectinadequaciesin ourapproach,hopefully
can be further
improved.
However,
model,andourunderstanding
ofthephenomenon
understudy,
ifone'sattitude
is simplyto answertheresearchquestionat hand,thenone mayarguein favourof
to (apparently)
incorrectness.
modelsthatareinsensitive
irrelevant
14.1. Example
formoftheparametric
modeldevelopedforEnglandandWalesalso describesthe
The functional
valuesare very
datafromtheUSA reasonably
well,as seen fromFig. 1, althoughtheparameter
in thereporting-delay
process.A similar,although
different,
especiallybecauseofthedifferences
modelcan be developedspecifically
forthesedata whichfitsverymuch
functionally
different,
ofthe
better.
thereis littlevisibledifference
whenit is graphed,
showingtherobustness
However,
inrecently
modelsforthequestionathand,at leastforpredictions
pastquarters.
parametric
15.
Discussion
theprincipal
These comments
havebeen made in thecontextof (scientific)
modeldevelopment,
ofmostappliedstatisticians.
butnotless important,
activity
Manyarenotapplicablein thatrarer,
However,in spiteof itsname,presentconfirmaprocessofmodeltesting,
or in decision-making.
thata null hypothesis
can be rejected.This
toryinferenceis essentiallynegative:confirming
clinicaltrials,requiring
tensof thousands
perhapsexplainswhyconfirmatory
phaseIII industrial
of subjectswhiletestingone endpointas imposedbyregulatory
agencies,do notyieldscientific
information
thatis proportional
to theirsize and cost. Owingto theinadequacyof exploratory
in phasesI and II forspecifying
to be tested,if propergoodness-of-fit
inference
thehypothesis
in phaseIII, it couldbe very
criteria
wereeverappliedto themodelforthealternative
hypothesis
embarrassing.
24
J.K. Lindsey
in
Certainof thepracticesthatI have criticized,
such as theuse of the normaldistribution
constructing
modelsand in developingasymptotic
inference
criteria,
originally
arosethrough
the
practicalneedto overcomethelimitations
of thetechnology
of thetimes.Withtheadvancement
oftechnology,
especiallythatofcomputing
power,suchrestrictions
areno longernecessary.
Otherpracticeshave arisendirectly
through
the development
of mathematical
out
statistics,
of touchwiththe realitiesof scientific
researchand appliedstatistics.
Based on well-reasoned
arguments
and deep mathematical
proofs,theyare themoredifficult
to fight.Reinforcing
this,
manypeople firstencounterstatistics
throughintroductory
coursespresenting
littleotherthan
theseinference
principles,
taughtby mathematically
trainedinstructors
and containing
fewhints
Aftera valiantstruggle
aboutusefulmodelsor realapplications.
to masterthecontradictions,
they
as theonlyonespossible,andthatstatistics
cometo accepttheseprinciples
is a difficult
subjectof
littlepracticaluse,a necessaryevil.
Nevertheless,applied statisticsseems oftento be leading mathematicalstatisticsin the
forexample,how,increasingly
of inference
development
procedures.
Consider,
widely,non-nested
theAIC is used
deviancesare directly
comparedin theappliedstatistics
literature
or,preferably,
in place of tests,in spiteof its known'poor' asymptotic
formodelcomparisons,
for
properties
selectingthe'true'model.
We practisestatistics
without
everquestioning,
orevenseeing,thesebasiccontradictions.
Thus,
forexample,we tellstudents
andclients
areacceptable,
(a) thatall modelsarewrong,butthatonlyconsistent
estimates
(b) thatwe can calculatetheminimum
necessarysamplesize, butthatwe mustthendraw
inferences
as ifitshouldideallyhavebeeninfinite,
mustbe formulated
beforelookingat thedata,butthattestingis appro(c) thathypotheses
priateas a stepwiseselectionprocedure,
statements
abouttheirlimits,but we interpret
(d) thatconfidenceintervalsare probability
themas probabilities
abouttheparameters,
and
thenwe use Jeffreys
orreference
mustobeythelikelihoodprinciple,
(e) thatinferences
priors.
inferencestatements,
true models,the
Unnecessaryunderlying
principles-probability-based
model
likelihoodas probability
accumulationof information,
density,asymptotic
consistency,
effects.
nesting,
minimizing
modelassumptions-havewide-ranging
and oftenunnoticed
harmful
The roleofthestatistician
is to questionandto criticize;letus do so withourownfoundations.
Acknowledgements
of
at theSymposium
on theFoundations
This is a revisedversionof an invitedpaperpresented
in HonourofDavid Sprott(University
ofWaterloo,
October3rd-4th,1996).
Statistical
Inference
MarkBecker,Denis de Crombrugghe,
DavidFirth,
Dan Heitjan,David Hinkley,
Murray
Aitkin,
Colin James,PhilippeLambert,TomLouis,PeterMcCullagh,JohnNelder,FranzPalm,Stephen
Senn and JohnWhitehead,as well as fivereferees,includingDavid Draperand David Hand,
criticisms
andsuggestions.
providedvaluablecomments,
References
theoryand an extensionof themaximumlikelihoodprinciple.In Proc. 2nd Int.Symp.
Akaike,H. (1973) Information
Inference
Theory(eds B. N. PetrovandF. Csaki),pp.267-28 1. Budapest:AkademiaiKiado.
andtimeseries.J R. Statist.Soc. A, 125,
C. B. (1962) Likelihoodinference
Barnard,G. A., Jenkins,
G. M. and Winsten,
321-352.
StatisticalHeresies
25
Berger,J.0. and Sellke,T. (1987) Testinga pointnullhypothesis:
theirreconcilability
of P valuesand evidence.J Am.
Statist.Ass.,82, 112-139.
Berger,J.0. andWolpert,
R. L. (1988) TheLikelihoodPrinciple:a Review,Generalizations,
and StatisticalImplications.
Hayward:Institute
ofMathematical
Statistics.
New York:Wiley.
Bernardo,
J.M. andSmith,A. F M. (1994) BayesianTheory.
ofstatistical
J Am.Statist.
Bimbaum,A. (1962) On thefoundations
inference
Ass.,57, 269-306.
J.F (1996) On thegeneralization
ofthelikelihoodfunction
Bj0rnstad,
andthelikelihoodprinciple.
J Am.Statist.Ass.,91,
791-806.
J Am.Statist.Ass.,71, 791-799.
Box, G. E. P. (1976) Scienceandstatistics.
of scientific
in thestrategy
modelbuilding.In Robustness
in Statistics
(1979) Robustness
(eds R. L. Launerand G.
N. Wilkinson),
pp. 201-236. New York:AcademicPress.
in scientific
(1980) SamplingandBayes' inference
modellingandrobustness
(withdiscussion).J R. Statist.Soc. A,
143,383-430.
Box,G. E. P.andTiao,G. C. (1973) BayesianInference
inStatistical
Analysis.New York:Wiley.
modelsandlife-tables
Cox,D. R. (1972) Regression
(withdiscussion).J R. Statist.Soc. B, 34, 187-220.
De Angelis,D. and Gilks,W R. (1994) Estimating
acquiredimmunedeficiencysyndromeincidenceaccountingfor
reporting
delay.J R. Statist.Soc. A, 157,31-40.
Fisher,R. A. (1922) On themathematical
foundations
oftheoretical
statistics.
Phil. Trans.R. Soc. Lond.A, 222,309-368.
2ndedn.Edinburgh:
OliverandBoyd.
(1959) Statistical
Methodsand Scientific
Inference,
(1966) DesignofExperiments.
Edinburgh:
OliverandBoyd.
Gardner,
M. J.andAltman,D. G. (1989) Statistics
withConfidence.
London:BritishMedicalJournal.
Gelman,A., Meng,X. L. and Stern,H. (1996) Posterior
predictive
assessment
ofmodelfitness
via realizeddiscrepancies.
Statist.Sin.,6, 733-807.
statistical
Hand,D. J.(1994) Deconstructing
questions(withdiscussion).J R. Statist.Soc. A, 157,317-356.
Data Analysis.London:ChapmanandHall.
Hand,D. J.andCrowder,
M. (1996) PracticalLongitudinal
the unconditional
cumulativeincidencecurveand its
Hay,J.W and Wolak,F A. (1994) A procedureforestimating
forthehumanimmunodeficiency
virus.Appl.Statist.,
variability
43, 599-624.
Heitjan,D. F (1989) Inference
fromgroupedcontinuous
data:a review.Statist.Sci.,4, 164-183.
Hill,B. M. (1963) The three-parameter
lognormaldistribution
and Bayesiananalysisof a point-source
epidemic.J Am.
Statist.
Ass.,58, 72-84.
and A(,,)or Bayesiannonparametric
inference.
(1988) De Finetti's
theorem,
induction,
predictive
Bayes.Statist.,
3,
211-241.
J. D. and Sprott,D. A. (1970) Applicationof likelihoodmethodsto modelsinvolvinglargenumbersof
Kalbfleisch,
parameters
(withdiscussion).J R. Statist.Soc. B, 32, 175-208.
J Am.Statist.
A. E. (1995) Bayesfactors.
Kass,R. E. andRaftery,
Ass.,90, 773-795.
Kass, R. E. and Wasserman,L. (1996) The selectionof priordistributions
by formalrules.J Am. Statist.Ass., 91,
1343-1370.
D. V (1957) A statistical
Lindley,
paradox.Biometrika,
44, 187-192.
ofprobability
J R. Statist.Soc. B, 36, 38-47.
distributions.
Lindsey,
J.K. (1974a) Comparison
ofstatistical
models.J R. Statist.Soc. B, 36, 418-425.
andcomparison
(1974b) Construction
and CountData. Oxford:OxfordUniversity
Press.
(1995a) ModellingFrequency
models.Appl.Statist.,
(1995b)Fitting
parametric
counting
processesbyusinglog-linear
44, 201-212.
Statistical
Oxford:OxfordUniversity
Press.
(1996a) Parametric
Inference.
bivariateintensity
withan application
to modellingdelaysin reporting
(1996b) Fitting
functions,
acquiredimmune
J R. Statist.Soc. A, 159, 125-131.
deficiency
syndrome.
J Statist.PlanngInf:,59, 167-177.
(1997a) Stoppingrulesandthelikelihoodfunction.
forexponential
(1997b) Exactsamplesize calculations
familymodels.Statistician,
46, 231-237.
Lindsey,J.K. and Jones,B. (1998) Choosingamonggeneralizedlinearmodelsappliedto medicaldata.Statist.Med., 17,
59-68.
of marginalmodelsforrepeatedmeasurements
in clinical
Lindsey,J.K. and Lambert,P. (1998) On theappropriateness
trials.Statist.Med.,17,447-469.
andcomparing
withlog linearmodels.Comput.Statist.
Lindsey,J.K. andMersch,G. (1992) Fitting
probability
distribution
Data Anal.,13,373-384.
creed.Appl.Statist.,
Nester,M. R. (1996) An appliedstatistician's
45, 401-410.
ofa model.Ann.Statist.,
thedimension
Schwarz,G. (1978) Estimating
6, 461-464.
W F (1898) On thecalculationofthemostprobablevaluesoffrequency
constants
fordataarranged
to
Sheppard,
according
divisionsofa scale. Proc.Lond.Math.Soc., 29, 353-380.
equidistant
D. J.,Freedman,L. S. and Parmar,M. K. B. (1994) Bayesianapproachesto randomizedtrials(with
Spiegelhalter,
discussion).J R. Statist.Soc. A, 157,357-416.
andtheirrelationto largesampletheory
D. A. (1973) Normallikelihoods
ofestimation.
Sprott,
Biometrika,
60,457-465.
(1975a) Marginalandconditional
sufficiency.
Biometrika,
62, 599-605.
ofmaximum
likelihoodmethodsforfinite
(1975b) Application
samples.SankhyaB, 37, 259-270.
to normality.
Can. J
(1978) Robustnessand non-parametric
proceduresare nottheonlyor the safe alternatives
26
J. K. Lindsey
Psychol.,32, 180-185.
(1980) Maximumlikelihoodin smallsamples:estimation
in thepresenceof nuisanceparameters.
Biometrika,
67,
515-523.
andmaximum
(1982) Robustness
likelihoodestimation.
Communs
Statist.Theory
Meth.,11,2513-2529.
Sprott,
D. A. andKalbfleisch,
J.D. (1969) Examplesof likelihoodsandcomparison
withpointestimatesand largesample
J Am.Statist.
approximations.
Ass.,64, 468-484.
D. A. andKalbfleisch,
Sprott,
J.G. (1965) Theuse ofthelikelihoodfunction
in inference.
Psychol.Bull.,64, 15-22.
Thompson,
E. A. (1994) MonteCarlolikelihoodin geneticmapping.Statist.Sci.,9, 355-366.
L. A. (1989) A robustBayesianinterpretation
Wasserman,
oflikelihoodregions.Ann.Statist.,17, 1387-1393.
Discussion on the paper by Lindsey
DavidJ.Hand (TheOpenUniversity,
MiltonKeynes)
I wouldlike to ask forsome clarification
of how you see therelativerolesof modeland questionin data
it seems to me thatthe emphasisis oftenon themodel,whereasit shouldbe on
analysis.In particular,
thequestion.You argue,on p. 23, thatmodelsand inference
proceduresshouldbe as sensitiveas possible
to the assumptions,
while also sayingon thatpage thatone may argue in favourof models whichare
insensitive
to irrelevant
incorrectness.
Please can youclarifyyourpositionon this?
The trickof modelbuildingis to choose a modelwhichdoes notoverfit
thedata,in thesense thatthe
models do not fitthe idiosyncrasiesdue to chance events,as well as fitting
the underlying
structure.
Recentyears have witnessedthe developmentof manymethodsforthis. Statisticiansoftensolve the
problemby choosing froma restrictedfamilyof models linear or generalizedlinear models, for
measure. The neural networkpeople, buildingon the
example or use a penalized goodness-of-fit
extremeflexibility
of theirmodels,have developedstrategiesforsmoothingoverfitted
models,as have
people workingon ruleinductionand treemethodsin machinelearning.Fromthefieldof computational
learningtheorywe have methodswhich choose a model whichhas minimumvariancefroma set of
overfitted
models,and so on. Many of theseapproachesare ratherad hoc. One of thethingsthatI like
aboutthispaper is thatit gives a principledcriterionto choose betweenmodels. You suggestthatthe
criterion
of compatibility
is attractive,
and I agree.Based on this,you suggestcalibratingthelikelihood
functionin termsof aP. By choosinga smallera one increasesone's confidencethattheincludedeffects
are real. However,you also suggestchoosinga smallera so thatonlythoseeffectswhichare practically
relevantare includedin themodel.It seemsto me thatthisis mixingup twouses of thea-parameter:its
use as a gauge to determine
how confident
we wantto be thatan effectis realbeforewe agreeto include
it in themodel,and itsuse to forcethemodelto be simplerforpracticalreasons.The firstof theseis an
whereasthe second forcesit away fromthistruth.It is notclear
attemptto reflecttheunderlying
truth,
thatthisis thebestwayto achievethesecondobjective.Perhapsa betterwaywouldbe to buildthebest
modelthatyou can and thento simplifyit relativeto theresearchquestion.This also allows one to take
thesubstantive
of different
effectsintoaccount.
importance
On p. 20 you say 'a parameter,
even a mean,has littleor no valid interpretation
outsidethe specific
model fromwhichit arises'. And you illustrate
thiswithmyfuelconsumption
example,sayingthatit is
to whichthemean refersare specified.But thisis notthecase. The aspects
resolvedif thedistributions
of the model which matterhere are the empiricalrelationshipswhich the numberswere chosen to
of fuel consumptionin
represent.Takingaccountof the Jacobianin movingbetweenthe distribution
miles per gallon and itsreciprocaldoes notsolve theproblemof whichof thetwo competingempirical
systemswe wishto drawa conclusionabout.
You say thatthemodelyoupresentin Section8.1 'is certainlyin no waythe"true"model',butsurely
it is, in fact,fundamentally
misspecifiedand cannotpossiblybe legitimate.In themodel,u is thedelay
time.As suchit is a ratioscale and we mayexpressit in different
units.If we do thiswe see thatan extra
term,v5t is introducedinto the model. If I fita correctedmodel includinga termin t, I obtain a
penalizeddevianceof 803.8, greaterthanthe devianceof 800.8 forthe incorrectmodel thatyou give,
whichis presumablywhyyou rejectedthe former.However,if I use (fractionsof) yearsforthe delay
time(approximated
by dividingthe giventimesby 4), I obtaina penalizeddevianceforyourmodel of
worsethanthatof mycorrectedmodel,which,of course,remainsat 803.8.
825.9, whichis substantially
oftheunitsthatyou happened
The misspecification
of yourmodelmeansthatthedevianceis a function
to use, which is absurd.In view of the elegantcompatibleformalismthatyou have introducedfor
choice of model formhas led to 'incompatibility'
comparingmodels,it is ironicthatan unfortunate
acrossa choiceofunits!
Discussion on thePaper by Lindsey
27
Of course,this is a detail which is easily corrected,and it does not detractfromwhat is a very
interesting
andvaluablepaper.It givesme genuinepleasureto proposethevoteofthanks.
David Draper (University
ofBath)
Thereis muchto admireand agree within thispaper; in particularthe authordeservesour thanksfor
tacklingpracticalissues in what mightbe termedthe theoryof applied statistics.I especially like
Lindsey'sremarkson theirrelevenceof
and
(a) asymptotic
consistency
is notknownwithcertainty
(b) thelikelihoodprinciplewhenmodel structure
(i.e., in real problems,
always).
thatI will focus
However,thereis also muchwithwhichto disagree,and it is on thesedisagreements
here,ordering
mycommentsfrommildto severecriticismoftheauthor'spositions.
(a) In Section2 Lindseysays,
'Many will arguethatthese[pointsof view on theexistenceof a "true" model] are "as if" situations:conditionalstatements,
notto be takenliterally'.
In otherwords,he is sayingthatthisis mathematics,
by whichI mean deductivereasoning:if
I make assumptionA = {a truemodel exists}, thenconclusionC follows.But recalling(e.g.
de Finetti(1974, 1975)) thatall statisticalinferenceproblemsmayusefullybe cast predictively,
withtheingredients
(i)
(ii)
(iii)
(iv)
alreadyobserveddata y,
as yetunobserveddata y*,
contextC abouthow thedataweregatheredand
assumptionsA abouthow y and y* are related
it seems to me thatthe best thatwe can ever do (no matterwhat our religiousviews on
Bayes may be) is conditionalconclusions,e.g. of the formp(y* ly, C, A): assumption-free
is notpossible.Thusto me statisticalinferenceis inherently
inference
deductive,
just liketherest
of mathematics;
we cannotescape our assumptions.(Even Fisher,theorginal'likelihoodlum'
in thispaper-is vague on thispoint,e.g. Fisher
littleattention
whoseviews receivesurprisingly
(1955).) Withhis advocatedmethodsthe authorcertainlydrawsconclusionsbased on assumptions;so in whatsense is he reasoninginductively?
undercutsmany of his criticisms.For example,
(b) Lindsey'sfailureto address decision-making
modelselectionis bestregardedas a decisionproblem:as I and othershave notedelsewhere(e.g.
Bemardo and Smith(1994) and Draper (1998a)), to choose a model you have to say to what
purposethe model will be put,forhow else will you knowwhetheryourmodel is sufficiently
based on context)he would
good? If the authortookthisseriously(specifyingutilityfunctions
discoverthathe can dispensewiththead hoc natureofthechoice ofhis quantitya.
(c) Lindseyclaimsin Section8 that
of averagingby integrating
out the
'The Bayesiansolution[to the problemof marginalization]
"nuisance" parametersis notacceptable:a non-orthogonal
likelihoodfunctionis indicatingthat
theplausiblevalues of theparameterof interestchange(and thushave different
interpretation)
dependingon thevalues ofthenuisanceparameters'
(emphasisadded). This is arrantnonsense.Forinstance,in thelocationmodel
yi = ,u+ a ei
IID
ei -vtv,,
it is truethata and v are typicallycorrelatedin the posteriordistribution
(e.g. witha diffuse
prior),but the meaningof v as an index of tail weightdoes not change fromone O-value to
and theweightedaverage
another,
p(vly)
Jp(vIa,
y)p(a y) da
of conditionaldistributions
as a summaryof model uncerp(vla, y) is perfectlyinterpretable
28
Discussion on thePaper by Lindsey
taintlyabouttheneed forthet-distribution
insteadofthesimplerGaussiandistribution.
(d) The author'sBayesianis a strawman who overemphasizescoherence,relieson highestposterior
densityregions,has neverheardof cross-validation
and is well out of date on methodology
and
outlook.As a practisingappliedBayesianI do notrecognizeLindsey'scaricaturein myselfor in
(thevastmajorityof) mycolleagues,and I commendresourcessuchas Gatsoniset al. (1997) and
Bernardoet al. (1998) to theauthor'sattention
ifhe wishesto forma moreaccurateview.Several
intofocus.
quotesfromthepaperwill bringthisdisparity
'For me, the quintessential... Bayesian dependson methodsbeing coherentafterapplyingthe
Bayes formulato personalpriorselicitedbeforeobservingthedata.'
Whata narrow,old-fashioned
view of Bayes thisis! Personalpriorsand coherenceneed notbe
all; out-of-sample
predictivecalibration(see below) can help to tune the choice of priorand
modelstructure.
... regionsof highestposterior
'For continuousparameters,
densityare notpreservedundernonlineartransformations
whereasthosebased on tail areas are.'
The fullposterioris invariant;and nobodyforceshighestposteriordensitysummarieson you
is anotherdecisionproblem,forwhichtail areas may be an adequate
(posteriorsummarization
workingapproximate
solution).
'The argumentin favourof subjectiveBayesianprobabilitiesreliescruciallyon theprincipleof
coherence,[whichunfortunately]
has onlybeen shownto applyto finitely
additivesetsofmodels.'
Coherenceis overratedas a justificationforBayes (in the sense thatit is necessarybut not
in good appliedwork),and theauthoroveremphasizes
sufficient
it.The bestthatit can offeris an
assuranceof internalconsistencyin probability
judgments;to be usefulscientific(and decisionmaking)co-workers,
we mustaspireto more as well: to externalconsistency(calibration).For
me theargument
fortheBayesianparadigmis thatit works:itallows theacknowledgement
of an
widerset of uncertainties,
appropriately
e.g. about model structure
as well as about unknowns
and itmakesthecrucialtopicofpredictioncentralto themodelling.
conditionalon structure;
'For a priorprobability
distribution
to exist,thespace ... of all possiblemodelsto be considered,
mustbe completelydefined.Anymodelsthatare notincluded,or thathave zero priorprobability,
will always have zero posteriorprobabilityunderthe applicationof the Bayes formula.No
... [Thusunder]
can giveunforeseen
modelspositiveposteriorprobability.
empiricalinformation
formalBayesianprocedures,scientific
discoveryis impossible.'
The firstpartof thisis true,and well knownto Bayesiansas Cromwells rule (Lindley,1972).
withBayesian nonparametric
But, firstly,
methodsbased on Markovchain Monte Carlo computations(e.g. Walkeret al. (1999) and Draper et al. (1998)) we can now put positiveprior
probabilityon the space of all (scientifically
defensible)models in interesting
problemsof real
calibration(e.g. Gelfandet al.
complexity,
and, secondly,we can use out-of-sample-predictive
withoutusing
(1992) and Draper(1998b)) to learnabout appropriatepriorson model structure
take the stingout of Cromwell'srule in
the data twice. Both of these approacheseffectively
appliedproblems:we are freeto findout thatour initialset of model structural
possibilitieswas
notsufficiently
richand to expanditwithoutcheating.
To summarize,the authorhas clearlyshruggedoffthe shacklesof frequentist
inferencesuccessfully
of his belief in thevalue of the likelihoodfunctionhas
(if ever he was so shackled),and the strength
broughthim to the very doorstepof Bayes, so withthe same sortof evangelicalspiritwithwhich
Lindseyhereproselytizesus about likelihood I urgehim to walk boldlythroughthe Bayesian door:
withmoderncomputingand predictivevalidationto keep himcalibrated,I thinkthathe will findhimself
rightat home.
The voteofthankswas passed by acclamation.
Chris Chatfield(University
ofBath)
This paperis about 'heresies'.The OxfordEnglishDictionarydefinesa heresyas: 'an opinioncontrary
to orthodoxor accepted doctrine'.But what is accepted doctrinein statistics?It is surelynot the
Discussion on thePaper by Lindsey
29
viewpointof a hardlineBayesianor an ardentfrequentist.
Ratherwe need to look to theviews of the
'typicalapplied statistician'who has to solve real problems.While agreeingthatdifferences
in philosophyexistin thenarrowfieldof inference,
theviews of moststatisticians
are muchmorein harmony
thanmightbe thought
fromtheliterature,
withcommongroundon thetenetsof good statisticalpractice,
suchas clarifying
objectivesclearlyand collectinggood data.
The papersaysmuchaboutwhatBayesiansand frequentists
mightdo, butsuch labels can be divisive
will use whicheverinferential
and I preferthe label 'statistician'.Most applied statisticians
approach
to solve a particularproblem.Perhapsthe real hereticsare statisticians
seems appropriate
who cannot
to applyone mode of inferencein all contexts'.
agreethat'we shouldnotattempt
The biggestfictionof statisticalinferenceis thatthereis a truemodelknowna priori.Surelyit is no
longera heresyto say this?Thereis a growingliterature
(e.g. Chatfield(1995) and Draper(1995)) on
modeluncertainty
topics,such as model mixingand sensitivity
analysis.Withno 'true' model, most
argumentsabout the foundationsof inferencecease to have much meaning and many inferential
'heresies'evaporate.Forexamplemodelcomparisonis clearlymoreimportant
thanmodeltestingifone
It mayhelp to expandinferenceexplicitlyto
acceptsthatone is lookingforan adequateapproximation.
include modelformulation,
but even thenit would only includepartof what a statisticiandoes. The
statistician's
job is to solveproblems,and not(just) to estimateparameters.
How should the paper change what I do? I would have liked more examples as the example on
acquiredimmunedeficiencysyndromewas not helpfulto me. In Fig. 1, I thinkthatI can make better
forecasts'by eye' thanthosegivenby anyofthemodels.
Overallthepapermakesmanygood pointsbutis less controversial
thanitthinksit is and I do nottell
mystudentsor clientsany of thepoints(a)-(e) listedin Section 15. I do agree thatstatisticsteaching,
and thestatistical
shouldmoretruthfully
reflectwhatstatisticians
literature,
actuallydo.
MurrayAitkin(University
ofNewcastle)
I am so muchin agreement
withJim'smainpointsthatI shallsimplylista fewpointsof disagreement.
The likelihoodprinciple
I findthe likelihoodprinciplepersuasivein its narrowform.The different
confidencecoveragesof the
referto referencesets of samples
likelihoodintervalforthepositiveand negativebinomialexperiments
The likelihoodintervalrefersto the experifromdifferent
hypothetical
replicationsof the experiment.
mentactuallyperformed.
Real replications,as in Jim'spredictions,
requirethe explicitmodel forthe
new data.Thereis no conflictwiththelikelihoodprinciplein this.
Models
The 'truemodel' is an oxymoron.Model specificationsare generallydeficientbecause theyomitthe
neglectedpart.The usual normalregressionmodel
iN(0, ga2),
Y, = a +3xi +ci,
failsto distinguish
betweennormalpopulationvariationabout the model mean f,u forfixedx and the
ofthemodelmeanfromtheactualpopulationmean.The representation
thatwe need is
departure
Yilxi N(Ji, a 2),
a + /xi + i,
0
thisoverdispersed
modelcan be fitted
where/i is theneglectedpart.Giving0 an arbitrary
distribution,
maximumlikelihoodas a finitemixture,
as in Aitkin(1996).
bynonparametric
-i
Model comparisons
I have suggestedin Aitkin
How shouldmodel comparisonsbe made, in a pure likelihoodframework?
of the likelihoodratio.In the
(1997) a generalizationof Dempster's(1974, 1997) posteriordistribution
simplestsingle-parameter
case, Ho specifies0 00,and HI is general.The likelihoodratio
LR
L(0o)/L(0)
has a posteriordistribution
derivedfromthatfor0. We can therefore
computetheposteriorprobability
thatLR < 0.1 or anyothervalue whichwouldsuggeststrongevidenceagainstHo. Stone(1997) givesan
exampleof Bernoullitrialswithn = 527135 and r = 106298, withHo: p = 0.2. The observedproportion p = 0.20165 is 3 standarderrorsfromHo, withP-value 0.0027. For a uniformprioron p, the
Bayes factorforHo to HI is 8 and thefractional
Bayes factoris 4.88, whereastheposteriorBayes factor
is 0.0156, in agreementwiththe P-value. The posteriordensityof p forthe same uniformprioris
N(0.20165, 0.000552), so Ho is faroutsidethe 99% highestposteriordensityintervalforp, contra-
30
Discussion on thePaper by Lindsey
theposteriorprobability
dictingtheBayes factorconclusion.Equivalently,
thatLR < 0.1 is 0.964. (The
posteriorprobability
thatLR < 1 is 1 - P = 0.9973.)
This approachis focusedon likelihoodratiosas measuresof strength
of evidence,and it expresses
conclusionsin posteriorprobability
termswhichrequireno morethanconventionaldiffuseor reference
priors.At the same time P-values can be expressedexplicitlyas posteriorprobabilities,providinga
of Bayes,likelihoodand repeatedsamplinginferential
unification
statements
fornestedmodelcomparisons.
D. R. Cox (University
of Oxford)
I was a littlebaffledby thispaper.Thereare certainlyimportant
and constructive
pointsto agree with
and warningsto be heededbutI foundit ratherdisconcerting
thatProfessorLindseyattributes
views to
In particular,
of conventional
othersso firmly.
his descriptions
statisticalthinking
are notrecognizableto
me,as a veryconventional
statistician.
For example,as a keen BayesianI was surprisedto read thatpriorsmustbe chosenbeforethe data
have been seen. Nothingin the formalism
demandsthis.Priordoes notreferto time,butto a situation,
hypothetical
whenwe have data,wherewe assess whatour evidencewould have been ifwe had had no
data.This assessmentmayrationally
be affectedbyhavingseen thedata,althoughthereare considerable
dangersin this,rathersimilarto thosein frequentist
theory.
As an enthusiasticfrequentist,
again some statements
seem to me not partof a sensiblefrequential
formulation.
On a moretechnicalpoint,it is nottruethata frequentist
approachis unableto deal with
non-nestedhypotheses;thereis an extensiveliterature
on thisovernearly40 years.Althoughthereare
difficulties
withthis,thereis no problemin principle.
ProfessorLindsey'slikingfora directuse of the likelihoodis veryunderstandable,
but the difficulties of nuisanceparametersare not reallyaddressed.We need some formof calibrationto avoid the
of theNeyman-Scotttypeas well as misbehaviourof the profilelikelihoodin othermuch
difficulties
simplerproblems.
Some of thedifficulties
whichProfessorLindsey,and manyothers,have withfrequentist
theoryas it
is sometimespresentedare based on a failureto distinguishbetweenthe operationalhypothetical
proceduresthatdefinetermslike significancelevel and confidenceinterval,and advice on how to use
them.The narrower
thegap betweenthesetwothebetterbutthereis a clear difference.
Thus,defining
a
significancelevel via a Neyman-Pearsonapproachrequiresa decision-likeattitudewhich Neyman
himselfdid notuse whenhe analyseddata.
Two finalcommentsare thatall carefulaccountsof asymptotic
theorystressthatthelimitingnotions
involvedare mathematical
devicesto generateapproximations.
These deviceshave no physicalmeaning
and discussionsthatsupposethatthereis sucha meaningare largelyirrelevant.
thefirstsentenceof thepaperis in conflictwiththeempirical
Finally,and perhapsmostimportantly,
observationthatstatisticalmethodsare now used in manyfieldson a scale vastlygreaterthaneven 20
yearsago. Not all theseapplicationsare good and surelythereare manyareas wherestatisticalmethods
are underusedor badly used and we should tryto addressthese. Althoughthereare no groundsfor
thegeneralnoteofpessimismthatstartsthepaperseemsto me misplaced.
complacency,
J. A. Nelder (ImperialCollege ofScience,Technology
and Medicine,London)
I welcome thispaper withveryfew reservations,
the title'The likelihood
but I would have preferred
school strikesback'. Many statisticians
believe thatinferencemustbe eitherfrequentist
or Bayesian,
whichis nonsense.I hope thatthesestatisticians
will come to realize thatenormoussimplifications
can
followfromLindsey'sapproach.For example,theadoptionof likelihood(support)intervalsfortheodds
ratioin 2 x 2 tableswouldlead to ourdiscardinga large,messyand confusingliterature.
I wish thatthe authorhad calibratedthe log-likelihood,
ratherthanthe likelihood,function.I findit
easier to deal withlog(a) =-1, ratherthana = 1/e. Furthermore,
plots withlog-likelihood(or deviance) as theordinateare muchmoreinformative
visuallythanthoseusinglikelihood,wheretheway in
whichtheordinategoes to zero is hardto see.
A topic missingfromthis paper is quasi-likelihood(QL). It is oftenintroducedby stressingthat,
because it is specifiedby the mean and variancefunctiononly,those are the only assumptionsbeing
ifwe normalizeit we findthat
made: notso. Froma QL we can construct
an unnormalizeddistribution;
the normalizingconstantis eitherexactlyconstantor varies only slowlywiththe mean. Further,
the
at least up to level 4, have a patternclose to thatwhichan
cumulantsof the normalizeddistribution,
This is theassumptionthatwe
exponentialfamilywouldhave ifone existedwiththatvariancestructure.
Discussion on thePaper by Lindsey
31
makewhenusingQL. In theformof extendedquasi-likelihood(EQL) (Nelderand Pregibon,1987) it is
identicalwiththe unnormalizedformof Efron'sdouble-exponential
families(Efron,1986). Any EQL
criterionand similarstatistics.In summary,QL is a
gives a deviance and hence Akaike information
valuableextensionto likelihoodinference.
I am interested
in the speaker'sreactionto the idea of the h-likelihood(Lee and Nelder,1996) as a
to modelswithmultiplerandom
way of generalizingtheFisherlikelihood,and so likelihoodinference,
thatis necessaryfor
effects.Maximizationof the h-likelihoodavoids the multidimensional
integration
theuse of marginallikelihood,and thearbitrary
specification
of priorsforthebottomlevel parameters.
Markov chain Monte Carlo samplingis replaced by a generalizationof iterativeleast squares,with
computingtime being reduced by ordersof magnitude.I hope thatthe speaker will contributeto
to thiswiderclass ofmodels.
extendinglikelihoodinference
S. M. Rizvi (University
of Westminster,
London)
I am delightedby thishighlyprovocativeand controversial
paper,as it conformsto some of my own
hereticalviews developed over the past 30 years of practisingand teachingstatistics.I had long
suspectedthatSirRolandFisherwas a clandestineBayesian,as some timeago I derivedhis multivariate
function
fromtheBayesianstandpoint.
discriminant
I also believethattheauthor'ssecondprinciple(p. 2) is self-contradictory.
It is necessaryto appreciate
thatepistemology
deals withquestionsof perception,questionsof inferenceand questionsabouttruth.
Thus modelswhichbest conformwithdata best accordingto one's light are a simplification
of the
in that.Further,
theauthor'sinsistence
reality,
and hencethatof ontologicaltruth.I see no contradiction
in whichmodels are cast is not supported
thatinferenceshouldbe invariantto the parameterizations
by experience,as the authorhimselflaterindicates.Once you accept a model,yourinferencetakes a
deductiveform.Anyinferenceis subservient
to theparameterization,
and notinvariant
to it. I rejecthis
secondprinciple.
Thereis no suchthingas a truemodel or a truetheory.Thereare onlymodelsand theoriesthatwork
or do notwork.Once we accepta modelbased on givendata,thereis no escape fromtheconclusions,at
leastwithrespectto thesampleconcerned.We need to acceptthatmodelsare tentative
and provisional.
This is whollyconsistentwith scientificinference.That judgmentsof science are provisionalis not
appreciatedevenby someNobel prize-winners.
I questiontheintroduction
of an 'intuitiveor pseudo-intuitive'
as theauthorappearsto do
argument,
whenhe castigatestrue'models' as counter-intuitive.
We need to appreciatethatintuitionis an aid to
butit is nota groundfora discursiveargument.
Russellwas to putit moreeloquentlywhen
perception,
with'animalinference'.
he equatedintuition
P-values are fine,but theydo not provideus witha clear dividingline foracceptingor rejectinga
A significance
testdoes.
hypothesis.
I take issue withthe author'sinsistencethatinferenceshouldnot be backed by probabilities.That
assertionsshouldbe based on probabilities
has respectableadvocates.It was thephilosopherJohnLocke
who was the firstto assertthatthe degreeof assentwe give to a propositionshouldbe based on its
probability.
Finally,confidenceintervalsare fineif you are dealingwitha two-sidedalternative
hypothesis;they
witha one-sidedalternative.
are misleadingwhenconfronted
AndrewEhrenberg(SouthBank University,
London)
in the
I am sure thatwe all laud ProfessorLindsey's effortto face up to the many contradictions
theoretical
of statisticalinference.
treatment
These contradictions
are,as he says,one reasonwhystatisticsis so oftenviewedas bothdifficult
and
unnecessary.
But even if statisticians
raisesit would
do face up to and resolvetheproblemswhichLindseyrightly
notimprove'the low publicesteemforstatistics'.Statisticswould stillbe seen as an 'unnecessaryevil',
because studentsand clientshardlyever actuallyuse the tools of statisticalinferencewhichtheyare
Nor is thereanysoundreasonwhytheyshoulduse them.
beingoffered.
As I see it,statisticalor probabilisticinferenceis at best a relativelyminortechnicalissue. It arises
whenone has just a singlerathersmall sample,whichhas been properlyselectedfroma well-defined
population.Thatis rare.
and clientsusuallyface
Instead,appliedstatisticians
(a) eitherquitea largedata set,or evena verylargeone,
32
Discussion on thePaper by Lindsey
butbroadlycomparablepopulations,and hencesampleswhichare at
(b) or data frommanydifferent
leastcumulatively
large,or verylarge,
(c) or a small data set whichis not an explicitsample fromany definedlargerpopulationanyway
(e.g. all theacquiredimmunedeficiencysyndrome(AIDS) cases in St Thomas'sHospitalin the
autumnof 1992).
But theseare not
Such datahaveproblemsof analysis,interpretation
and,broadlyspeaking,inference.
statisticsin generalor in ProfessorLindsey'spaperin particular.
addressedeitherbytheoretical
To finishwitha specificpoint:ProfessorLindseysays thatin his experiencemorethan95% of the
inferential
workwhich is taken to statisticiansfor advice is exploratory,
i.e. concernedwithmodel
development.
But whatappliedstatisticians
and clientsmostlydo on theirown is to use existingmodels(formalor
informal),
ratherthanalwaysto developnew ones. For example,anyanalysisof AIDS data shouldnow
use thepriorknowledge(or 'verbalmodel') thatthereare delaysin thereporting
of cases of AIDS, and
also in therecordingofthem.
fromfirstdevelopingit.
Usinga modelis verydifferent
G. A. Barnard (University
ofEssex, Colchester)
Lindsey'spaper is verymuchto be welcomedforits emphasison the factthatstatistics,like mathematicalphysicsand fluidmechanics,is a branchof applied mathematics
in whichresultsmaybe very
valuabledespitebeingonlyapproximate.
And his stresson therelevanceof new computingpossibilities
to thelogic of ourfieldis mosttimely.
forinference
My onlyseriousqueryarises fromhis remark(p. 2) thatFisherwas nevera frequentist
purposes.True, he made clear in his 1925 classic thatinferenceswere to be expressedin termsof
likelihood,notin termsof probability.
And he was muchcloserto Jeffreys
thanto theNeyman-Pearson
school. But likelihoodratioshave a long runfrequencyinterpretation,
in termsof odds, thoughnot in
termsofprobabilities.
If E has been observed,thelikelihoodratioforH versusH' is
L(E) = Pr(E IH) /Pr(EIH').
If we choose H onlywhen L(E) exceeds W,and H' onlywhen 1/L(E) exceeds W,makingno choice
thenwhenevereitherH or H' is truethelong runfrequency
ratioof rightchoices to wrong
otherwise,
choiceswill exceed W
Limitationson computerpowermeantthatFishercould not implement
thisview in his lifetime.But
now deskcomputers
withspreadsheetfacilitiescan exhibitvariouslikelihoodfunctions
fullyconditioned
on thegivendata on variousassumedmodels.We onlyneed to worryabout such differences
whenthe
spreadsheetsdifferseriouslyin relationto the questionsof interest.Such cases are much rarerthan
unconditionalapproachesmightlead us to think.The Cushny-Peeblesdata quotedby Studentin 1908
and by Fisherin 1925 and moreaccuratelyby Senn (1993) illustrate
thispoint.Cushnyand Peebleswere
interestedin the evidence for or against chiralitywith the drugs used. Withoutchiralitythe data
about0. Withchirality
it wouldbe symmetrical
aboutsome non-zero
distribution
wouldbe symmetrical
value. Assumingexact normality
and no chiralitythe odds on t fallingshortof ratherthanexceeding
4.03 are exactly698 to 1. But if we replace the false assumptionof normalitywiththe equally false
assumptionof a Cauchydensitywe findthatfor thedata to hand the Cauchy spreadsheetdifferslittle
fromthe normalspreadsheet.And littlechange resultsfromtakingpropernote of the factthatthe
patientswould nothave been allowedto sleep formorethan24 hours.To relateour spreadsheetsto the
butwe can see from
questionat issue we mayneed to makeassumptionsabouttheirrelevant
parameters,
thespreadsheethow much,or how little,theseassumptions
matter.
StephenSenn (University
College London)
This stimulating
and Bayesian
papermakesmanyexcellentpointsnotleast of whichis thatfrequentist
modelsbutthatthisrequirement
is commonlyignored.However,I
approachesbothrequireprespecified
am notconvincedthattheauthor'simplicitdistinction
betweenparameterestimationand model choice
is fundamental.
For example,he objectsto 'incorporating
otherthan
priorknowledgeand information,
thatrequiredin theconstruction
ofthemodel',butin practiceI thinkthattherewill be manycases where
will be unsurewhatthisLindseyianprescription
the applied statistician
permitsand forbids.Take the
AB/BA crossoverdesign (Grieve and Senn, 1998). We can considermodels withand withoutcarryover.For thischoice it seems thatthe authorwould allow priorknowledgeto play some role,but the
lattermodelis merelya special case of theformer
withA,thecarry-over
effect,equal to 0. However,the
Discussion on thePaper by Lindsey
33
modelwithcarry-over
effectsincludedis uselessunlesscoupledwithan informative
priorthatA is likely
to be smallwhich,however,it seemshis prescription
forbidsus to assume.
Also the problem(Section 8) of what is called consonance in the technicalliterature
on multiple
testing(Bauer, 1991) is notuniqueto statisticsand thereis nottheslightestdifficulty
in explainingit to
non-statisticians.
Two squabblingchildrenare separatedby theirmother.She knowsthatat least one of
themis guiltyof aggressivebehaviourbutshe will oftenbe incapableof establishingdefinitely
thatany
givenchildis. Is thisincompatible
withhumanlife?
Finally,a small matter:contraryto Section 15, it is the exceptionratherthanthe normfordrug
let alone individualclinicaltrials,to involvetensof thousandsof subjects.A
development
programmes,
typicalprogramme
will involve5000 or fewersubjectsand individualtrialsare usuallymeasuredin the
hundreds.The author'spointaboutlostopportunities
in measurement
is, none-the-less,
well taken.
D. V. Lindley(Minehead)
My commentsare mostlyon statements
made in thepaperaboutBayesianideas.
(a) 'Inferencestatements
mustbe made in termsof probability'because of arguments
by de Finetti
and others.These take self-evident
principlesand fromthemprovethatuncertainty
satisfiesthe
probabilitycalculus. It is wrongto say thatthe Bayesian methodis 'internallycontradictory'
whenitis based on coherence theavoidanceof contradiction.
is nottaken'formathematical
convenience'.Thereare enoughexamplesto
(b) Countableadditivity
showthatsome consequencesof finiteadditivity
are unacceptable.In anycase, conglomerability
is self-evident
formost.
(c) Likelihood is not, alone, adequate for inferencebecause it violates one of the self-evident
principlesin (a). It can happenthat,withsets of exclusiveevents{Ai}, {Bi}, Ai is morelikely
thanBi forall i, yettheunionoftheAi is less likelythanthatof theBi.
(d) Likelihoodis not additive,so thereis no generalway of eliminatingnuisanceparameters.13
varietiesof likelihoodtestify
to thestruggleto overcomethishandicap.
The joint distri(e) The old examplein Section 8 is not incompatiblein the Bayesian framework.
butionof a, /3has marginsfora and /3separately.Is thesuggestionthatthelatterare incompatible withtheformer?
(f) Probabilityis invariant.What may not be invariantare summarystatistics,like the mean or
notof thedistribution.
highestposteriordensityintervals.But thisis a criticismof thesummary,
Similarcommentsapplyto thecomparisonofthenegativeand positivebinomials.
is alwaysconditionalon whatis known,or assumed,whentheuncertainty
statement
(g) Probability
is made.It is nottherefore
truethat'all possibleeventsmustbe defined'.
(h) All of us are continuallymakinginformalpriorjudgmentsafterseeing the data. Justbefore
writingthis,I readof some data on lobbyists.A reactionwas to thinkof myprior(to thedata) on
and to recallthe'cash-for-questions'
affairin theHouse of Commons.
lobbyists,
it is not
Like probability,
(i) A model is yourdescriptionof a small worldin termsof probability.
betweenyou and thatworld.
objectivelytrue,butexpressesa relationship
(j) The statement
abouttheparadoxis wrong.The modelwithpointpriordoes notdominateif the
alternative
is true.
said thata modelshouldbe as big as an elephant.Untilrecently
(k) Manyyearsago Savage correctly
we have not been able to act on the advice forcomputational
reasons.It now looks as though
but therestill remainsthe
Markovchain Monte Carlo samplingwill overcomethe difficulty,
distributions.
unsolvedproblemoftheassessmentand understanding
of multivariate
Nick Longford(De Montfort
University,
Leicester)
JimLindseyshouldbe congratulated
on a thoughtful
critiqueof manyconventionsdeeplyingrainedin
statisticalpractice.Many of themdo have a rationalbasis, butonlyin a specificcontext.Our failureis
thatwe applythemuniversally.
I have specificcommentson Section11.
I do not see the conflictbetween'the largerthe sample the better'and the needs of an applied
All otherdesignfactorsbeing equal, largersamplesizes are associatedwithmoreinformastatistician.
tion.The purposeof the sample size calculationsis to save resourcesand to ensureprecisionthatis at
least as highas a presetstandard.When a model of certaincomplexity(smoothness)is requiredit is a
faultof themodel selectionroutinethatlargerdata sets tendto yieldmorecomplexmodels.No model
selectionprocedureis verygood if it functions
well onlyfor'good' sample sizes. A moreconstructive
approachthanthatimpliedby Lindsey'sdiscussionis to remove,or smoothout,fromthemodel selected
34
Discussion on thePaper by Lindsey
by a formalstatisticalprocedureany effects(differences)
thatare substantively
unimportant.
Then we
can address,ifneedbe, theissue ofprecisiongreaterthantheminimumacceptable.
Vaguelyspeaking,by the sample size calculationswe aim at matchingstatisticalsignificancewith
substantive
importance.The product,the sample size n, maybe way offthemarkbecause the formula
used involvesresidualvarianceand otherparameters,
some of whichare onlyguessed,oftenwithonly
modestprecision.Also, in multipurpose
studies,or studiesthatwill by analysedby a wide constituency
of secondaryusers,thesamplesizes will be 'too large' forsome inferences
and too smallforothers.It is
withthe latter,but no problemwiththe former.If everything
unfortunate
else fails,simplyanalyse a
simplerandomsubsampleofthecases.
T. A. Louis (University
ofMinnesota,Minneapolis)
Lindseymakes manyexcellentand manyprovocativepoints.Most of my commentsderivefromthe
view thatthelikelihoodis keyto developinginferences
and thatno model is correct.ThoughI agreeon
thelikelihood'scentralrole,effective
designsand analysesalso dependon inferential
goals and criteria
(see Shenand Louis (1998) on thepoorperformance
ofhistograms
and ranksbased on posteriormeans).
Lindsey supportsgoal- and criteria-driven
proceduresin his considerationof Bayesian highest
content
posteriordensity(HPD) regions.An HPD regionhas thesmallestvolumefora givenprobability
butis nottransformation
equivariant.Posteriortail area intervalsfora scalar parameterare equivariant,
and Lindseypromotestheiruse. However,transform
equivarianceis onlyone criterionand I shall use
for a skewed posterior,for a multimodal
the HPD region when minimizingvolume is important,
posterioror one withouta regularmode and fora multivariate
parameter.
Lindseyis correctthatno model is correct.Unfortunately,
he concludesthatmodel complexitymust
increaseas the sample size increasesand so asymptoticanalysesbased on an assumedtruemodel are
irrelevant.
Believingthatsome parametersretaintheirmeaningwithincreasingsample size, I strongly
A largesample size allows us to
disagreethatasymptotic
propertiessuch as consistencyare irrelevant.
increasecomplexity
butshouldnot requireus to. Also, asymptotics
can revealimportant
propertiesand
justifymodemmethodsoffinitesampleinference
suchas thebootstrap.
In choosingbetweenmodels, aP likelihoodcalibrationeffectively
structures
the trade-off
between
complexityand degreesof freedom.Selectinga is vitaland moreguidanceis needed (whatis thebasis
in choosing
fora = 0.22 in Section 8.1?). However,goals and priorknowledgecan be instrumental
betweencandidatemodels (see Fosterand George (1994)). And, selectionproceduressuch as crossvalidationhave the advantagesof dealingdirectlywithpredictivegoals and of being relativelyfreeof
assumptions.
continuousprobability
models.Continuousdistributions
Lindseygoes overboardin criticizing
provide
to (discrete)realityand have theirquirks,but theycan be effectivein studying
onlyan approximation
propertiesand communicating
insights.Furthermore,
replacingtheGaussianby thePoissondistribution
will generateas manyproblemsas iteliminates.
Lindsey criticizesthe use of standarderrors(SEs). His criticismis old but worthrepeating.SEs
summariesof thelikelihood(or
shouldbe replacedby a set of likelihoodregionsand otherinformative
or posteriordistribution).
samplingdistribution
statistical
Finally,are the'heresies'in thetitlecurrent
practicesor theauthor'sviews?
Peter McCullagh (University
ofChicago)
In statistics,
as in fashion,a model is an idealizationof reality.JoePublic knowsthisall too well, and,
feelobligedto engagepublicallyin incoherent
he does notordinarily
althoughhe maygrumblesilently,
does not live up to the image portrayedin glamour
Quixotic ramblingswhen his wife or girlfriend
magazinesand romanticnovels. Scientistsknowthisalso. To a large degree,science is the searchfor
Mendel'slaws are good science,butit does not
a searchformodelsas an aid forunderstanding.
pattern,
followthattheyare universallyobeyedin nature.Justas Mendel's laws do notcoverall of genetics,so
too neithertheBayesiannorthe frequentist
approachcoverseveryfacetof statisticalpractice.I do not
see this as a bad thingforgeneticsor statistics;it is evidence of the richnessand diversityof both
subjects.
forvalid criticism.But sweepingunsubstanCurrentstatisticalpracticepresentsample opportunity
tiatedclaims,undueemphasison thepicayuneand plain errorsof factdo not add up to a strongcase.
in theBayesianapproach.
I see no internalcontradictions
Despitetheauthor'ssinceresermonizing
issues is ordinarily
welcome.Whilethispaper
The opportunity
to discussand to reassessfoundational
is passionatelyprovocativeon minorpointssuch as likelihood(3), the discussionof major issues is
Discussion on thePaper by Lindsey
35
irritating
in tone and confusedon the facts.Despite this,the data analysisseems reasonablysensible,a
triumphof commonsense overunsounddogma.In thediscussionof foundational
matters,
however,the
partsof thepaperthatare trueare notnew,and partsthatare neware nottrue.
R. Allan Reese (Hull University)
Althoughagreeingthatthelack of sharedcredoand approachcontributes
to thepoor publicperception
of statisticsand statisticians,
it is extremeto place theblameentirely
on statisticians
and teachers.Users
of statisticalmethodsthemselvesapproachthesubjectwithfearand reluctance,treating
it as somewhere
betweenabstrusemathematics
and witchcraft.
An inabilityto relatequantitative
data to realitymaystart
well beforestudentsmeetstatistics(Greer,1997). Fromthatstarting
point,and invariablyagainsttight
to instilgood principlesof dataanalysis.
deadlines,itis difficult
Thereare shiningexamplesof good practiceand excellentsoftware,
and I have taughtusingGenstat,
GLIM, SPSS and Stata.However,I have knowna studentattenda whole courseon GLIM onlyto ask
'buthow do I do regressionlike in thebook?'. Mead's (1988) classic workcontainsan excellentchapter
on 'Designingusefulexperiments',
butthisis tuckedaway at theback because 'the standardformatof
books on designis inevitable,and therefore
correct'.I wishthathe had challengedthatview!
Underlying
thepaper appearsto be a philosophicalquestion.We make assumptionsand hence propose mathematicalmodels whichreflecta theorythatis externalto the data. We can testthe models
and findthosewhichfitbest.However,we can nevertestor verifythe assumptions.Each real-lifedata
set is unique.Even whenrepeatinga well-established
protocolundercontrolledconditions,we need to
allow forthepossibilityof outliers.Hence theargument
is forevercircular we trustthemodelbecause
it fitsthe data, and we believe the data because theyfollowthe model. In thatsense, 'all' statisticsis
heuristicand all relieson inference.
This dilemmais illustratedin Fig. 1(a). A Health Ministerseeing thatgraphwould ask, 'Is the
incidenceof acquired immunedeficiencysyndromenow going up or down?'. Because of delays in
reporting,
we can 'only' answerthatquestionby relyingon a model. The model mustincludeassumptionsaboutthepatternof reporting.
Choose yourassumptions,
and you can give whicheveransweryou
thinktheMinisterwouldprefer:lies, damnedlies,and .. .?
The followingcontributions
werereceivedin writingafterthemeeting.
P. M. E. Altham(University
ofCambridge)
ProfessorLindseyhas givenus a discussionpaperwhichis scholarly,
and thoughtprovoking.
thoughtful
I havejust a smallpointto make.
The example about acquired immunedeficiencysyndromecorrespondsto a large and sparse (trithedevianceof 716.5, with413 degreesof freedom(DF)
table.The sum forming
angular)contingency
quoted on p. 8, and the subsequentdeviancesforthe same data set includea highproportionof cells
withfittedvalues around0 or 1. This must surelyhave a severe effecton the reliabilityof the x2A traditional
approximation.
remedyis to combinesome of the cells. For example,as a quick solution
we combinethelasteightcolumnsofthedata setof De Angelisand Gilks(1994). Thenthedeviancefor
model (1) is now 514.54 with254 DE Using thenegativebinomialfitting
routinesuppliedin Venables
and Ripley's(1997) library,
I foundthatthecorresponding
negativebinomialmodelhas deviance337.79
with254 DE Consideringthatthetotalsamplesize is 6160, thisdoes notseem such a severedeparture
frommodel(1) and it allows some degreeof overdispersion
relativeto thePoissonmodel.
F. P. A. Coolen (University
ofDurham)
I shall mentiontwo approachesto statisticalinferenceand referto commentsby Lindsey.I strongly
supportthe subjectiveapproachto inferencewheneverpracticalsituationsrequirethe use of knowthatis available in experimental
or historicaldata. de Finetti(1974) used
ledge otherthaninformation
prevision(or 'expectation')insteadof probabilityas the main tool forinference(Lindsey,Section3),
and indeed previsionfor observablerandomquantitiesprovidesa naturalconcept for dealing with
This line of workhas been extendedsuccessfully
uncertainty.
by Goldstein;see theoverviewand further
referencesin Goldstein(1998). One of the nice foundationalaspects of this work is that a Bayes
but
posterioris the expectationof whatone's beliefswill be afterobtainingthe additionalinformation,
notnecessarilyequal to thosebeliefs(Goldstein,1997). Of course,scientific
discovery(Lindsey,Section
to a subjectivist.The factthatactual beliefsmightbe different
froma Bayes posterior
3) is important
is perfectlyacceptable.The subjectiveapproachallows the use of non-observablerandomquantities,
or parameters,but well interpretable
quantitiesare needed forassessment.Problemswith regardto
36
Discussion on thePaper by Lindsey
inferenceon uninterpretable
parameters(Lindsey,Section 8) are avoided if the actual problemsare
expressedin termsofwell interpretable
randomquantities;
Notwithstanding
my supportforthe subjectiveapproachto inference,I feel the need for a nonsubjectivemethodforinduction.I was delightedto see Lindsey'sreference
to Hill (1988) butsad thatno
further
was paid to Hill's assumptionA(,,)and relatedinference.
In manysituations(butnotfor
attention
examplein the case of timeseries),bothBayesiansand frequentists
are happywitha finiteexchangeabilityassumptionon futurereal-valuedobservations,
meaningthatall possibleorderingsare considered
as equallylikely.Fornonparametric
methodsthisassumptionof equallylikelyorderings
shouldstillhold
aftersome data have been observed.This post-dataassumptionis Hill's A(,,),whichis not sufficient
to
in manysituationsbutdoes providebounds(Coolen, 1998) which
derivepreciseprobabilistic
inferences
are impreciseprobabilities(Walley,1991). I suggestthatA(,) providesan attractive,
purelypredictive,
thatcan be used as a 'base-linemethod'(Lindsey,Section13). Moreover,
methodforstatistical
inference
morerealworlddata than
A(,) can deal withdata describedby Zipf's law (Hill, 1988), whichrepresents
any otherknownlaw, includingthe Gaussian law, and I doubtwhetherthisis trueforPoisson process
models(Lindsey,Section6). The mainchallengeforA(,,)-basedstatisticalinferenceis generalization
to
multivariate
data (Hill, 1993).
JamesS. Hodges (University
ofMinnesota,Minneapolis)
ProfessorLindsey'spaperis mostwelcomeand greatfun.Hear,hear: 'Foundationsof statistics'is dead
to practice;theneeds of applied statisticians
because it is irrelevant
shoulddrivethenew foundational
on manyspecifics,e.g.
debate.(Hodges (1996) takesa similartack.)Lindsey'spaperwas also delightful
Bayes factors.
None-the-less,two of ProfessorLindsey'skey premisesare unsatisfying.
The firstis that 'model
development[is] the principalactivityof most applied statisticians'.In 15 years of highlyvaried
statisticalpractice,myprincipalactivityhas been helpingcolleagues to answersubstantivequestions.
Contra Lindsey ('Scientistsusually come ... with a vague model formulation
to be filled out by
empiricaldata collection,not witha specifichypothesisto test'), my colleagues almostalwayshave a
specificpurpose; thoughit is not always formulatedas testingsome , = 0, ProfessorLindsey's
descriptionis stillinaccurate.Moreover,selectinga model rarelymatterseven to me. Instead,themain
issue is whetherthe data give the same answerto the substantivequestion in alternativeanalyses,
thatis internalto each analysis.(Oddly,Fig. 1 shows no uncertainty
accountingforthe uncertainty
bounds.) A minorityof such alternativeanalysesare motivatedby statisticalobsessionslike outliers.
of variablesand other
More often,theyinvolveexcludinglarge subsetsof the data,variantdefinitions
in modelform.
in likelihoodfunctions
and notworthenshrining
problemsthatare awkwardto represent
The answerto thesubstantive
questionusuallydoes notchangeacrossthealternative
analysesbut,when
it does, in myexperienceno meremodelselectioncriterion
will convincea scepticalcolleague or editor
on a particularanalysis.
is compoundedby ProfessorLindsey'spreferred
This difficulty
criterion,
forscepticsmustbuy the
is
criterion
magicnumbera. But wheredoes a come from?SometimestheAkaikeor Bayes information
acceptable,butotherwisewe mustpick a. In Section8.1, a = 0.22-not 0.21, not0.23 is presentedas
'reasonable'. Why should we preferProfessorLindsey's view of reasonableto anyone else's? The
rationalein Section 11.1 is better:we need a = I0-'3 to choose the parametricmodel overthe others.
This movesin a potentially
usefuldirection-whatdoes it taketo changetheanswer? butis stillquite
removedfromthequestionsforwhichsomeoneobtainsmoneyto collectdatato bringto a statistician.
D. A. Sprott(University
of Waterloo)
This is an interesting,
and well-written
thoughtful
paper.I have onlya fewminorpointsof disagreement
or emphasis.
It wouldbe nice to believethatthecriticismsof theemphasison unbiasedestimatesare well known.
Butjournalsstillseem intenton publishingpaperscorrecting
themaximumlikelihoodestimateforbias
and estimatingthe varianceof the resultingestimateseven thoughthe resultingestimatingintervals
containhighlyimplausible,ifnotimpossible,values oftheparameter.
inference.
But mymain commentis aboutthepositionrelegatedto confirmatory
or non-exploratory
is exploratory.
But I thinkthattheadvanceof sciencealso rests
Perhaps95% of theworkof statisticians
on what happens afterthe exploratorystage the process of demonstrating
repeatabilityand of the
subsequentaccumulationof evidence.This is given shortshriftby the commentthatwhen this stage
is reachedsophisticatedstatisticsis no longerneeded,the factsspeakingforthemselves.This is un-
Discussion on thePaper by Lindsey
37
doubtedlythe finalresult,as discussedby Fisher(1953a) and by Barnard(1972). But much,possibly
statistics
sophisticated,
precedesit.Fisher(1953b) was himselfan exampleof this.Perhapsthescientists
who originatetheproblems,notstatisticians,
will be thosewho are involvedin thisprocess,as suggested
by MacGregor (1997). But this would merelyconfirmthe obvious low esteem advertedto in the
aboutthislow esteemseemsless
introductory
sentenceofthepaper,althoughtheconcernof statisticians
obvious.
The authorrepliedlater,in writing,
as follows.
I am pleased to see thehighlevel of constructive
criticismby mostdiscussants.This topiccan engender
a
strongemotions!Many questionthatI say anythingheretical.Yet, afterone earlierpresentation,
theoreticalstatisticianasked me 'Do you actuallyteach yourstudentsthis stuff?'.One referee'sconclusion was 'This is a shallow,incomprehensible,
unscholarlyofferingthatoughtnot to grace our
is ofteneverydaypracticefortheappliedstatistician.
Society'sjournals'.Heresyforthetheoretician
In exploratory
scientific
as in muchof appliedstatistics,
thegoal is to convinceyourpublic,
inference,
herethe scientificcommunity,
thatthe empiricaldata supportyourscientifictheorybetterthanothers
if a scientistapplies independent
replicationto the statistical
available.Replicabilityis key In contrast,
he or she risksreceivingquitedifferent
answers!
analysisprocess,consultingseveralstatisticians,
is thatwe oftenhave one favouritemodel, and inferenceproOne greatweaknessof statisticians
to
cedure,thatwe tendto applyeverywhere
possible. In contrast,the likelihoodapproachis restricted
scientific
modelselection.It has nothingto offerin courtdecisions,industrial
qualitycontrol
exploratory
or phase III clinicaltrialdrugacceptance.
in mycaricaturesof Bayesians
Severaldiscussantsobjectthattheydo notrecognizereal statisticians
but these descriptionsreferto foundationprinciples,not to people. Should I have
and frequentists,
strictly
followingBayesian
replaced'Bayesiansor frequentists
do' each timeby 'an applied statistician
or frequentist
momentdoes'? (Most theoreticians
are less flexible.)If theproprinciplesat thisparticular
ceduresof thesetwo schools are meantto describegood practice,whythendo certainargue('Neyman
himselfdid notuse [it]whenhe analyseddata') thattheyare meantto be ignoredin application?Should
we notexpectthe same highstandardsfromour professionas we do of scientistswho consultus? Will
because ofthearbitrariness?
notourpublicotherwisebe unconvinced,
Virtuallyall the discussantsseem agreed thatno model is true.Few, however,address my more
fundamentalargument,thatprobabilityis inappropriatefor exploratoryinference.Comparabilityof
likelihoodcontourlevels withchangingparameterdimensionrequiressome functiong(p) forcalibrait is
tion:L(O; y) xcg(p). Forlikelihoodinference,
L(O; y) xcaP;
(5)
forthefrequentist
approach,usingthedeviance,
L(O; y) oc exp(-Xp/2)
(6)
whereasfortheBayesianapproachit is morecomplex,
L(O; y)
h{Pr(OIy)} f(y)
Pr(O)
(7)
Althoughequations(5) and (6) superficially
appearmostsimilar,theyare not.Ratherequations(6) and
whereasequation(5) is a pointfunction.
(7) bothrequireintegration,
Several discussantsask formore detailsabout the choice of a. As witha confidenceor credibility
level, it specifiesthe level of precisiondesired:how much less probablecan a model make the data,
of
comparedwithyourbestone,beforeyou concludethatit is implausible?Alongwiththespecification
a scientifically
useful effectin planninga study,this thendeterminesthe sample size. Justas 95%
so has a = 1/e.
intervalshavebeen foundreasonablein manypracticalsituations,
I agreevirtuallycompletelywithmanyof the discussants(George Barnard,ChrisChatfield,
Andrew
Ehrenberg,David Hand, JamesHodges, Nick Longford,JohnNelder,Allan Reese, StephenSenn and
David Sprott)and shallonlyreplyto some ofthen,verybriefly.
The dangerof exploratory
inferenceis that,ifenoughdifferent
modelsare consideredfora givendata
modelswill alwaysbe found.Thus, I agree withDavid Hand thatthe questionis
set,some well fitting
therestricted
primary:
groupof modelsto be considered,whetherbeforeor,ifnecessary,afterseeingthe
data, must be sensitiveto the questionbut insensitiveto pointsthatare irrelevantto the question.
38
Discussion on thePaper by Lindsey
However,I disagreethatforcinga model to be simple drivesit away fromthe truth:the art is to use
to highlight,
simplicity
and to convinceyouraudienceof,thereal effects.
I also maintainmysolutionto thefuelconsumption
problem:thequestiondetermines
whichunitsto
use, but a unique model must apply for the two units or the answersto the two questionswill be
fundamentally,
notjust superficially,
contradictory.
As to theproblemwithunitsin myown model (John
Nelder firstpointedthiserrorout to me), I can only plead guilty:the model mustbe hierarchicalcriticismbythescientific
in operation.
community
Frequentist
proceduresare notoriousforbeing ad hoc; David Draperbringsus up to date on recent
in applied confirmatory
patchesto Bayesiantheory.It will make its breakthrough
statisticswhenphase
III clinical trialprotocolsspecifypriordistributions.
The beautyof the likelihoodapproachis thatit
yieldssimpleclear interpretations,
requiringno such arbitrariness:
models arejudged by how probable
theymaketheobserveddata.
To convince your audience, probabilisticinferencestatementsrequireprior (to data collection)
guarantees:a fixedmodel (hypothesis)and sample space forthe frequentist,
a fixedset of modelswith
theirprior probabilitiesfor the Bayesian (Senn, 1997). In this sense, they are deductive,making
scientificdiscoverydifficultor impossible.Likelihood inferenceallows freedomto considermodels
suggestedby the data,withthe same interpretable
measuresof uncertainty,
althoughtheywill be less
convincingthanthosestatedbeforehand,
hence requiringrepeatednesswithinthe scientific
community.
Scientificinferenceis nota (necessarilyindividual)decisionproblem.
David Draperdownplaystheimportanceof coherence.But how can a posteriorBayesiancalculation
have a probabilisticinterpretation
withoutcoherence,exceptin theformalmathematicalsense thatany
setofnon-negative
numberssummingto 1 has?
MurrayAitkinhas not yet convincedme thatposteriorBayes factorscan help, comparedwiththe
of likelihoodratios,in communicating
to studentsand clients.
simpleinterpretation
Nowheredo I claim thatfrequentists
are unable to deal withnon-nestedhypotheses.As David Cox
pointsout,methodsare available,although,as I stated,theyare notwidelyaccepted.Perhapsthereason
is thatthe underlying
models are unreasonablein most contexts.Most standardcourses(and all texts
thatI have been able to consult) do not cover these methods.Studentsare only taughtthe nested
approachand infer(iftheyare nottold!) thatitis theonlyone possible.
The questionof nuisanceparametersarises afterselectinga model,whenwe make inferencesabout
thisis notthetopicof thepaper.Nuisanceparameterscreateno problem
of interest;
specificparameters
forlikelihood-basedexploratory
inference.Profiles,as summariesof a complexlikelihoodsurface,are
in the likelihoodapproach(farmorethanmarginalposteriors).The judgmentthat
alwaysinterpretable
theymisbehaveis asymptotic
(e.g. consistency).Thus asymptotic
theorydoes notjust generateapproximationsbut permeatescriteriaforchoosingprocedures.To be acceptable(publishable),a procedure
must yield consistent,asymptotically
normalestimatesnot on the parameterspace boundary,with
calculable standarderrors,superiorasymptotic
relativeefficiency,
and so on, all irrelevant
to likelihood
inference,
butneed notgivetheprobability
ofthedata (estimating
equations).
I am more pessimisticthan David Cox about our image. Statisticsis more widely used through
compulsion,notthe activedesireof manyactual users;wide use does not implyrespectability.
(If you
want to see the expressionof real disgust,withoutrevealingthatyou are a statisticiantalk to any
laboratoryassistantor junior scientistwho once sufferedthroughintroductory
statistics.)The main
impulsionforthisexpansionappearsto be theneed forgood design,notinference.
I entirely
shouldbe calibrated.I do thatin theexample.
agreewithJohnNelderthatthe log-likelihood
However,in didactic communicationwith studentsand clients,speaking of the probabilityof the
is oftenclearer.
observeddataundervariousmodels,ratherthanitslogarithm,
With modem computingpower, quasi-likelihoodscan easily be normalizednumericallyso this
approachis now amenableto fulllikelihoodinference.I have nothad theoccasion to workwiththe hlikelihood,perhapsbecause I have not yet encounteredconsultingproblemsrequiringit. The large
numberofparameters
mustmakecompleteinferences
(and notjust modelselection)complex.
I agreewithAndrewEhrenbergthatmostexploratory
workinvolvescomparingexistingmodels;I did
notmeanto implythatthe95% referred
to developingnewmodels.
for inferencepurposes,as I definedit, is not the same as using a frequency
Being a frequentist
of probability
in inference,
interpretation
somethingthatFisherobviouslydid. Indeed,I discussFisherian P-values GeorgeBarnardprovidesanotherexample and we shouldnotforgetfiducialprobability.
But I can findno reference
whereFisherrequiredlongrunguaranteesofhis procedures;he ratherargued
thatsignificance
oftheobserveddataundergivenhypotheses.
testsprovidemeasuresofrarity
Discussion on thePaper by Lindsey
39
If StephenSenn'smodelwithcarry-over
is uselesswithoutan informative
prior,thenwhatexactlycan
the data tell us aboutit? Can we calculatetheprobability
of thedata underthismodel and is theresult
If not,and we believethata small carry-over
effect
distinctfromthatforthemodelwithoutcarry-over?
is present,and important,
perhapsthedesignshouldbe changedto obtaintherequiredinformation.
Incompatibility
meansthatthemotherwouldeitherdiscover,individually,
thatneitherof thechildren
or, collectively,thatat least one is guiltyof aggressivebehaviour.The point is thatthe criterionof
whentheyarejudged separatelyand together.
individualaggressiveness
(precision)differs
Dennis Lindley's 'self-evident
Unfortunately,
principles'are not evidentto me. He agrees thatthe
alternative
model mustbe trueto avoid theparadoxof pointpriorprobabilities.This is fundamental
to
of
myrejectionofthisBayesianapproachto exploratory
modelselection.Assessmentand understanding
multivariate
distributions
of thedata are a sufficiently
big problemforme withouttheadded complicationofprobabilities
on theparameters.
I would be interested
to knowhow Nick Longfordproposesto remove,or smoothout,froma model
This seems to be developinga new model,a further
effectsthatare substantively
step in
unimportant.
theexploratory
process.
quality
Larger sample sizes are rarelybetter.Besides costs and time,problemsof homogeneity,
and so on, increase.I strongly
control,missingness,
oppose analysinga randomsubsampleof thedata; if
theyare available,use themall.
I agreewithTom Louis thata largesampleperhapsshouldnotrequireus to increasecomplexitybut
show me a statisticalprocedurethatdoes not have this characteristic
forfixedprecisionlevel, unless
thereis a truemodel.
I do not criticizecontinuousprobabilitymodels,but ratherlikelihoodsbased on theirdensities.A
theprobabilityof the observeddata,mustbe based on discreteprobabilitieseven
likelihoodfunction,
whenthemodelis continuous.
I make no claim to newnessin the paper; indeedI statedthe oppositein mypresentation.
It would
havebeen morehelpfulifPeterMcCullaghhad pointedoutwhatwas nottrueratherthanmakingstrong
unsubstantiated
accusations.Nor have I claimed any internalcontradictions
in the formalBayesian
approachbased on coherentprobabilities.The problemis ratherwhenwe tryto implementit in a contextforwhich it was not designed:exploratoryscientificinference.As illustratedby David Draper's
it is now in a stateof arbitrariness
comments,
rivallingfrequentist
theory.
As Pat Althampointsout,sparsecontingency
tables are of concernforinferencesusing asymptotic
such as the%2-distribution.
approximations,
However,I am arguingagainstthe latterin the contextof
work.Such tablesdo notpose a problemforlikelihoodinference.
exploratory
Overdispersionmodels are meant to handle counts with dependentevents,usually on the same
individualunit.Thatis notthecase herewherean assumptionof independently
reportedacquiredimmune
foroverdispersion.
eventsshouldbe reasonable.Hence,I wouldopposecorrecting
deficiency
syndrome
I certainlyagreewithJimHodges thatscientistscome witha specificpurpose(question),butto me
is to formulateit in termsof models givingtheprobabilityof the data and to
the role of a statistician
select the most appropriateset. Nowheredo I argue forkeepingonlyone model. In likelihood-based
model selection,theprocedureforselectingamongmodel functionsis identicalwiththatforchoosing
parametervalues
among parametervalues. Justas a likelihoodregioncontainsmodels withdifferent
supportedbythedata,so you can have an envelopeofplausiblemodelfunctions.
Dave Sprottis absolutelyrightabout the importanceof repeatabilityto scientificinference.My
worthattempting
a demonstration
concernhereis withdiscoveringsomething
ofrepeatability.
Resortto emotionalaccusations,as withone Bayesian-inclined
refereeand one frequentist-inclined
and no reasonedreplyis available.In
discussant,oftenmeansthatfundamental
principlesare threatened
currentstatistics,
thefundamental
divideis notbetweenfrequentists
and Bayesiansbutbetweenapplied
and journals. As one very
statisticiansand the theoreticiansdominatingmany teachinginstitutions
eminentstatistician
told my son afterthemeeting,applied statisticsis a peculiarprofessionwhereyou
haveto unlearnmuchofwhatyouweretaughtas a student.
References in the discussion
in generalized
linearmodels.Statist.Comput.,
likelihoodanalysisofoverdispersion
Aitkin,M. (1996) A generalmaximum
6, 251-262.
of the
(1997) The calibrationof P-values,posteriorBayes factorsand the AIC fromthe posteriordistribution
7, 253-272.
likelihood(withdiscussion).Statist.Comput.,
40
Discussion on thePaper by Lindsey
Barnard,
G. A. (1972) Theunityofstatistics.
J R. Statist.Soc. A, 135, 1-15.
in clinicaltrials.Statist.Med.,10, 871-890.
Bauer,P. (1991) Multipletesting
Bernardo,
J.M., Berger,J.,Dawid,A. P. and Smith,A. F. M. (eds) (1998) BayesianStatistics6. Oxford:OxfordUniversity
Press.To be published.
J.M. andSmith,A. F. M. (1994) BayesianTheory.
Wiley.
Bernardo,
Chichester:
C. (1995) Model uncertainty,
(withdiscussion).J R. Statist.Soc. A, 158,
Chatfield,
dataminingand statistical
inference
419-466.
inference
forBayes' problem.Statist.Probab.Lett.,36, 349Coolen,F P. A. (1998) Low structure
imprecisepredictive
357.
incidenceaccountingfor
De Angelis,D. and Gilks,W R. (1994) Estimating
acquiredimmunedeficiencysyndrome
reporting
delay.J R. Statist.Soc. A, 157,31-40.
A. P. (1974) The directuse of likelihoodforsignificance
Dempster,
testing.In Proc. Conf:FoundationalQuestionsin
P. Blaesild and G. Sihon), pp. 335-352. Aarhus:University
of
StatisticalInference(eds 0. Barndorff-Nielsen,
Aarhus.
7, 247-252.
(1997) The directuse oflikelihoodforsignificance
testing.
Statist.Comput.,
andpropagation
ofuncertainty
Draper,D. (1995) Assessment
(withdiscussion).J R. Statist.Soc. B, 57, 45-97.
forbreastcancer"byG Parmigiani.
In BayesianStatistics6
(1998a) Discussionof "Decisionmodelsin screening
Press.To be published.
(eds J.M. Bernardo,
J.Berger,A. P. DawidandA. F. M. Smith).Oxford:OxfordUniversity
calibration.
TechnicalReport.Statistics
Group,
(1998b) 3CV: Bayesianmodelchoicevia out-of-sample
predictive
ofMathematical
ofBath,Bath.
Department
Sciences,University
Bayesiannon-parametric
inference
withskewed
Draper,D., Cheal,R. and Sinclair,J.(1998) Fixingthebrokenbootstrap:
and long-taileddata. TechnicalReport.StatisticsGroup,Department
of Mathematical
Sciences,University
of Bath,
Bath.
familiesand theiruse in generalizedlinearregression.
Efron,B. (1986) Double exponential
J Am.Statist.Ass.,81, 709721.
de Finetti,
B. (1974) TheoiyofProbability,
vol. 1. Chichester:
Wiley.
(1975) TheoryofProbability,
vol. 2. Chichester:
Wiley.
Fisher,R. A. (1953a) The expansionofstatistics.
J R. Statist.Soc. A, 116, 1-6.
(1953b) Dispersionon a sphere.Proc.R. Soc. A, 217,295-305.
(1955) Statistical
methods
andscientific
induction.
J R. Statist.Soc. B, 17,69-78.
formultiple
regression.
Ann.Statist.,
22, 1947-1975.
Foster,D. P. andGeorge,E. I. (1994) Theriskinflation
criterion
vol. 3. New York:
Gatsonis,C., Hodges,J.S., Kass,R. E. andMcCulloch,R. E. (1997) Case StudiesinBayesianStatistics,
Springer.
withimplementation
usingpredictive
distributions,
Gelfand,A. E., Dey,D. K. andChang,H. (1992) Model determination
via sampling-based
methods(withdiscussion).In BayesianStatistics4 (eds J.M. Bernardo,
J.0. Berger,A. P. Dawid
Press.
andA. F M. Smith),pp. 147-167. Oxford:OxfordUniversity
forposteriorjudgements.In Proc. 10thInt. Congr.Logic, Methodologyand
Goldstein,M. (1997) Priorinferences
Kluwer.
Philosophy
ofScience(eds M. L. D. Chiara,K. Doets,D. MundiciandJ.vanBenthem).Dordrecht:
(1998) Bayeslinearanalysis.In EncyclopaediaofStatisticalSciences,updatevol. 3. New York:Wiley.To be published.
classrooms:thecase ofwordproblems.LearnngInstructn,
7, 293-307.
Greer,B. (1997) Modellingrealityinmathematics
in clinicalcross-over
treatment
trials.J Biopharm.
effects
Statist.,
8, 191-233.
Grieve,A. andSenn,S. J.(1998) Estimating
inference
and A(,) or Bayesiannonparametric
predictive
(withdisHill, B. M. (1988) De Finetti'stheorem,
induction,
3 (eds J.M. Bernardo,
M. H. DeGroot,D. V LindleyandA. F. M. Smith),pp. 211-241.
cussion).In BayesianStatistics
Press.
Oxford:OxfordUniversity
modelsforAn,:splitting
J R. Statist.Soc. B, 55,423-433.
(1993) Parametric
processesandmixtures.
In Modelingand
a sketchof a theoryof appliedstatistics.
Hodges,J. S. (1996) Statisticalpracticeas argumentation:
Prediction
Geisser(eds J.C. Lee, A. ZellnerandW 0. Johnson),
pp. 19-45. New York:Springer.
HonoringSeymour
linearmodels.J R. Statist.Soc. B, 58, 619-656.
Lee, Y. andNelder,J.A. (1996) Hierarchical
generalized
andAppliedMathematics.
a Review.Philadelphia:
Lindley,D. V (1972) BayesianStatistics,
SocietyforIndustrial
J.F (1997) Usingon-lineprocessdata to improvequality:challengesforstatisticians.
Int.Statist.Rev.,65,
MacGregor,
309-323.
2nd edn. Cambridge:
StatisticalPrinciplesfor PracticalApplications,
Mead, R. (1988) The Design of Experiments:
Press.
Cambridge
University
D. (1987) An extended
function.
Biometrika,
74, 221-232.
Nelder,J.A. andPregibon,
quasi-likelihood
Trialsin ClinicalResearch.Chichester:
Senn,S. (1993) Cross-over
Wiley.
ofpriorspastis notthesameas a trueprior.Br.Med.J, 314, 73.
(1997) Presentremembrance
intwo-stage
models.J R. Statist.Soc. B, 60, 455-471.
estimates
hierarchical
Shen,W andLouis,T. A. (1998) Triple-goal
andAitkin.Statist.Comput.,
7, 263-264.
Stone,M. (1997) DiscussionofpapersbyDempster
withS-Plus.New York:Springer.
Venables,W N. andRipley,B. D. (1997) ModernAppliedStatistics
inferencefor random
Walker,S. G., Damien,P., Laud, P. W and Smith,A. F M. (1999) Bayesiannonparametric
andrelatedfunctions
distributions
(withdiscussion).J R. Statist.Soc. B, 61, inthepress.
London:ChapmanandHall.
Probabilities.
Walley,P. (1991) Statistical
ReasoningwithImprecise