Download 9.1.1 The Reasoning of Significance Tests Significance Test

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Transcript
9.1.1TheReasoningofSignificanceTests
SignificanceTest-aformalprocedureforcomparingobserveddatawithaclaim(alsocalled
ahypothesis)whosetruthwewanttoassess.Theclaimisastatementaboutaparameter,likethepopulation
proportionporthepopulationmeanμ.Weexpresstheresultsofasignificancetestintermsofaprobability
thatmeasureshowwellthedataandtheclaimagree.
Statisticaltestsdealwithclaimsaboutapopulation.Testsaskifsampledatagivegoodevidenceagainsta
claim.Atestmightsay,“Ifwetookmanyrandomsamplesandtheclaimweretrue,wewouldrarelygeta
resultlikethis.”Togetanumericalmeasureofhowstrongthesampleevidenceis,replacethevagueterm
“rarely”byaprobability.
Example–I’maGreatFree-ThrowShooter!
Thelogicofstatisticaltests
Abasketballplayerclaimstobean80%freethrowshooter.Supposeheshoots50freethrowsandmakes32
ofthem.Hissampleproportionofmadeshotsis
Whatcanweconcludeabouttheplayer’sclaimbasedonthesampledata?
Ourreasoningisbasedonaskingwhatwouldhappeniftheplayer’sclaim(p=0.80)weretrueandwe
observedmanysamplesof50freethrows.WeusedFathomsoftwaretosimulate400setsof50shots
assumingthattheplayerisreallyan80%shooter.Thefigurebelowshowsadotplotoftheresults.Eachdoton
thegraphrepresentstheproportionofmadeshotsinonesetof50attempts.Forexample,iftheplayermakes
43/50shotsinonetrial,thedotwouldbeplacedatp-hat=0.86.
Youcansayhowstrongtheevidenceagainstthe
player’sclaimisbygivingtheprobabilitythathe
wouldmakeasfewas32outof50freethrowsif
hereallymakes80%inthelongrun.Basedonthe
simulation,ourestimateofthisprobabilityis
3/400=0.0075.Theobservedstatistic,p-hat=
0.64,issounlikelyiftheactualparametervalue
isp=0.80thatitgivesconvincingevidencethat
theplayer’sclaimisnottrue.
Besurethatyouunderstandwhythisevidenceisconvincing.Therearetwopossibleexplanationsofthefact
thatourvirtualplayermadeonly
ofhisfreethrows:
1.
Theplayer’sclaimiscorrect(p=0.8),andbybadluck,averyunlikelyoutcomeoccurred.
2.
Thepopulationproportionisactuallylessthan0.8,sothesampleresultisnotanunlikelyoutcome.
Explanation1mightbecorrect—theresultofourrandomsampleof50shotscouldbeduetochance
alone.Buttheprobabilitythatsucharesultwouldoccurbychanceissosmall(lessthan1ina100)thatweare
quiteconfidentthatExplanation2isright.
Statisticaltestsuseanelaboratevocabulary,butthebasicideaissimple:anoutcomethatwouldrarely
happenifaclaimweretrueisgoodevidencethattheclaimisnottrue.
9.1.2StatingHypothesis
Asignificanceteststartswithacarefulstatementoftheclaimswewanttocompare.Inourfree-throwshooter
example,thevirtualplayerclaimsthathislong-runproportionofmadefreethrowsisp=0.80.Thisisthe
claimweseekevidenceagainst.Wecallitthenullhypothesis,abbreviatedH0.Usually,thenullhypothesisisa
statementof“nodifference.”Forthefree-throwshooter,nodifferencefromwhatheclaimedgivesH0:p=
0.80.Theclaimwehopeorsuspecttobetrueinsteadofthenullhypothesisiscalledthealternative
hypothesis.WeabbreviatethealternativehypothesisasHa.Inthiscase,webelievetheplayermightbe
exaggerating,soouralternativehypothesisisHa:p<0.80.
NullHypothesis-Theclaimtestedbyastatisticaltestiscalledthenullhypothesis(H0).Thetestisdesignedto
assessthestrengthoftheevidenceagainstthenullhypothesis.Oftenthenullhypothesisisastatementof“no
difference.”
AlternativeHypothesis-Theclaimaboutthepopulationthatwearetryingtofindevidenceforisthe
alternativehypothesis(Ha).
Inthefree-shooterexample,ourhypothesesare:
wherepisthelong-runproportionofmadefreethrows.Thealternativehypothesisisone-sidedbecausewe
areinterestedonlyinwhethertheplayerisoverstatinghisfree-throwshootingability.BecauseHaexpresses
theeffectthatwehopetofindevidencefor,itissometimeseasiertobeginbystatingHaandthensetupH0as
thestatementthatthehoped-foreffectisnotpresent.Hereisanexampleinwhichthealternativehypothesis
istwo-sided.
Example–StudyingJobSatisfaction
Statinghypotheses
Doesthejobsatisfactionofassembly-lineworkersdifferwhentheirworkismachine-pacedratherthanselfpaced?Onestudychose18subjectsatrandomfromacompanywithover200workerswhoassembled
electronicdevices.Halfoftheworkerswereassignedatrandomtoeachoftwogroups.Bothgroupsdid
similarassemblywork,butonegroupwasallowedtopacethemselveswhiletheothergroupusedan
assemblylinethatmovedatafixedpace.Aftertwoweeks,alltheworkerstookatestofjobsatisfaction.Then
theyswitchedworksetupsandtookthetestagainaftertwomoreweeks.(Thisexperimentusesamatched
pairsdesign,whichyoulearnedaboutinChapter4)Theresponsevariableisthedifferenceinsatisfaction
scores,self-pacedminusmachine-paced.
(a)Describetheparameterofinterestinthissetting.
(b)Stateappropriatehypothesesforperformingasignificancetest.
Thehypothesesshouldexpressthehopesorsuspicionswehavebeforeweseethedata.Itischeatingtolookat
thedatafirstandthenframehypothesestofitwhatthedatashow.Forexample,thedataforthejob
satisfactionstudyshowedthattheworkersweremoresatisfiedwithself-pacedwork.Butyou
shouldnotchangethealternativehypothesistoHa:μ>0afterlookingatthedata.Ifyoudonothaveaspecific
directionfirmlyinmindinadvance,useatwo-sidedalternative.
One-sidedalternativehypothesis-thealternativehypothesisisone-sidedifitstatesthataparameter
islargerthanthenullhypothesisvalueorifitstatesthattheparameterissmallerthanthenullvalue.
Two-sidedalternativehypothesis-thealternativehypothesisistwo-sidedifitstatesthattheparameter
isdifferentfromthenullhypothesisvalue(itcouldbeeitherlargerorsmaller).
Inanysignificancetest,thenullhypothesishastheformH0:parameter=value.Thealternativehypothesishas
oneoftheformsHa:parameter<value,Ha:parameter>value,orHa:parameter≠value.Todeterminethe
correctformofHa,readtheproblemcarefully.
Hypothesesalwaysrefertoapopulation,nottoasample.BesuretostateH0andHaintermsofpopulation
parameters.Itisnevercorrecttowriteahypothesisaboutasamplestatistic,suchas
or
.
CHECKYOURUNDERSTANDING
For each of the following settings, (a) describe the parameter of interest, and(b) state appropriate hypotheses for a
significance test.
1. According to the Web site sleepdeprivation.com, 85% of teens are getting less than eight hours of sleep a night.
Jannie wonders whether this result holds in her large high school. She asks an SRS of 100 students at the school how
much sleep they get on a typical night. In all, 75 of the responders said less than 8 hours.
2. As part of its 2010 census marketing campaign, the U.S. Census Bureau advertised “10 questions, 10 minutes—
that’s all it takes.” On the census form itself, we read, “The U.S. Census Bureau estimates that, for the average
household, this form will take about 10 minutes to complete, including the time for reviewing the instructions and
answers.” We suspect that the actual time it takes to complete the form may be longer than advertised.
9.1.3InterpretingP-values
Theideaofstatinganullhypothesisthatwewanttofindevidenceagainstseemsoddatfirst.Itmayhelpto
thinkofacriminaltrial.Thedefendantis“innocentuntilprovenguilty.”Thatis,thenullhypothesisis
innocenceandtheprosecutionmusttrytoprovideconvincingevidenceagainstthishypothesis.That’sexactly
howstatisticaltestswork,althoughinstatisticswedealwithevidenceprovidedbydataanduseaprobability
tosayhowstrongtheevidenceis.
ThenullhypothesisH0statestheclaimthatweareseekingevidenceagainst.Theprobabilitythatmeasures
thestrengthoftheevidenceagainstanullhypothesisiscalledaP-value.
P-value-Theprobability,computedassumingH0istrue,thatthestatistic(suchasp-hatorx-bar)wouldtakea
valueasextremeasormoreextremethantheoneactuallyobservediscalledtheP-valueofthetest.The
smallertheP-value,thestrongertheevidenceagainstH0providedbythedata.
SmallP-valuesareevidenceagainstH0becausetheysaythattheobservedresultisunlikelytooccur
whenH0istrue.LargeP-valuesfailtogiveconvincingevidenceagainstH0becausetheysaythattheobserved
resultislikelytooccurbychancewhenH0istrue.We’llshowyouhowtocalculateP-valueslater.For
now,let’sfocusoninterpretingthem.
Example–I’maGreatFree-ThrowShooter
InterpretingaP-value
TheP-valueistheprobabilityofgettingasampleresultatleastasextremeastheonewedidifH0were
true.SincethealternativehypothesisisHa:p<0.80,thesampleresultsthatcountas“atleastasextreme”are
thosewith
Inotherwords,theP-valueistheconditionalprobability
Earlier,weusedasimulationtoestimatethisprobabilityas3/400=0.0075.SoifH0istrue,andtheplayer
makes80%ofhisfreethrowsinthelongrun,there’slessthana1in100chancethattheplayerwouldmake
asfewas32of50shots.ThesmallprobabilitygivesstrongevidenceagainstH0andinfavorofthe
alternativeHa:p<0.80.
ThealternativehypothesissetsthedirectionthatcountsasevidenceagainstH0.Inthepreviousexample,only
valuesofp-hatthataremuchlessthan0.80countasevidenceagainstthenullhypothesisbecausethe
alternativeisone-sidedonthelowside.Ifthealternativeistwo-sided,bothdirectionscount.
Example–StudyingJobSatisfaction
InterpretingaP-value
Doesthejobsatisfactionofassembly-lineworkersdifferwhentheirworkismachine-pacedratherthanselfpaced?Onestudychose18subjectsatrandomfromacompanywithover200workerswhoassembled
electronicdevices.Halfoftheworkerswereassignedatrandomtoeachoftwogroups.Bothgroupsdid
similarassemblywork,butonegroupwasallowedtopacethemselveswhiletheothergroupusedan
assemblylinethatmovedatafixedpace.Aftertwoweeks,alltheworkerstookatestofjobsatisfaction.Then
theyswitchedworksetupsandtookthetestagainaftertwomoreweeks.(Thisexperimentusesamatched
pairsdesign,whichyoulearnedaboutinChapter4)Theresponsevariableisthedifferenceinsatisfaction
scores,self-pacedminusmachine-paced.
Forthejobsatisfactionstudydescribedabove,thehypothesesare:
whereμisthemeandifferenceinjobsatisfactionscores(self-paced−machine-paced)inthepopulationof
assembly-lineworkersatthecompany.Datafromthe18workersgave
andsx=60.Thatis,these
workersratedtheself-pacedenvironment,onaverage,17pointshigher.
ResearchersperformedasignificancetestusingthesampledataandobtainedaP-valueof0.2302.
(a)Explainwhatitmeansforthenullhypothesistobetrueinthissetting.
(b)InterprettheP-valueincontext.
(c)Dothedataprovideconvincingevidenceagainstthenullhypothesis?Explain.
The conclusion of the job satisfaction study is not that H0 is true. The study looked for evidence against H0: µ = 0 and
failed to find strong evidence. That is all we can say. Failing to find evidence against H0 means only that the data
are consistent with H0, not that we have clear evidence that H0 is true.
9.1.4StatisticalSignificance
Thefinalstepinperformingasignificancetestistodrawaconclusionaboutthecompetingclaimsyouwere
testing.Wewillmakeoneoftwodecisionsbasedonthestrengthoftheevidenceagainstthenull
hypothesis(andinfavorofthealternativehypothesis)—rejectH0orfailtorejectH0.Ifoursampleresultistoo
unlikelytohavehappenedbychanceassumingH0istrue,thenwe’llrejectH0.Otherwise,wewillfailto
rejectH0.
Thiswordingmayseemunusualatfirst,butit’sconsistentwithwhathappensinacriminaltrial.Oncethejury
hasweighedtheevidenceagainstthenullhypothesisofinnocence,theyreturnoneoftwo
verdicts:“guilty”(rejectH0)or“notguilty”(failtorejectH0).Anot-guiltyverdictdoesn’tguaranteethatthe
defendantisinnocent,justthatthere’snotenoughevidenceofguilt.Likewise,afail-to-rejectH0decisionina
significancetestdoesn’tmeanthatH0istrue.Forthatreason,youshouldnever“acceptH0”oruselanguage
implyingthatyoubelieveH0istrue.
Example–FreeThrowsandJobSatisfaction
Drawingconclusions
Inthefree-throwshooterexample,theestimatedP-valueof0.0075isstrongevidenceagainstthenull
hypothesisH0:p=0.80.Forthatreason,wewouldrejectH0infavorofthealternativeHa:p<0.80.Itappears
thatthevirtualplayermakesfewerthan80%ofhisfreethrows.
Forthejobsatisfactionexperiment,however,theP-valueof0.2302isnotconvincingevidenceagainstH0:μ=
0.WethereforefailtorejectH0.Researcherscannotconcludethatjobsatisfactiondiffersbasedonwork
conditionsforthepopulationofassembly-lineworkersatthecompany.
Inanutshell,ourconclusioninasignificancetestcomesdownto
ThereisnoruleforhowsmallaP-valueweshouldrequireinordertorejectH0—it’samatterofjudgmentand
dependsonthespecificcircumstances.ButwecancomparetheP-valuewithafixedvaluethatweregardas
decisive,calledthesignificancelevel.Wewriteitasα,theGreekletteralpha.Ifwechooseα=0.05,weare
requiringthatthedatagiveevidenceagainstH0sostrongthatitwouldhappenlessthan5%ofthetimejustby
chancewhenH0istrue.Ifwechooseα=0.01,weareinsistingonstrongerevidenceagainstthenull
hypothesis,aresultthatwouldoccurlessoftenthan1inevery100timesinthelongrunifH0istrue.When
ourP-valueislessthanthechosenα,wesaythattheresultisstatisticallysignificant.
StatisticallySignificant-IftheP-valueissmallerthanalpha,wesaythatthedataarestatisticallysignificantat
levelα.Inthatcase,werejectthenullhypothesisH0andconcludethatthereisconvincingevidenceinfavorof
thealternativehypothesisHa.
“Significant”inthestatisticalsensedoesnotnecessarilymean“important.”Itmeanssimply“notlikelyto
happenjustbychance.”Thesignificancelevelαmakes“notlikely”moreexact.Significanceatlevel0.01is
oftenexpressedbythestatement,“Theresultsweresignificant(p<0.01).”
HerePstandsfortheP-value.TheactualP-valueismoreinformativethanastatementofsignificancebecause
itallowsustoassesssignificanceatanylevelwechoose.Forexample,aresultwithP=0.03issignificantat
theα=0.05levelbutisnotsignificantattheα=0.01level.Whenweuseafixedsignificanceleveltodrawa
conclusioninastatisticaltest,
Example–BetterBatteries
StatisticalSignificance
AcompanyhasdevelopedanewdeluxeAAAbatterythatissupposedtolastlongerthanitsregularAAA
battery.However,thesenewbatteriesaremoreexpensivetoproduce,sothecompanywouldliketobe
convincedthattheyreallydolastlonger.Basedonyearsofexperience,thecompanyknowsthatitsregular
AAAbatterieslastfor30hoursofcontinuoususe,onaverage.ThecompanyselectsanSRSof15newbatteries
andusesthemcontinuouslyuntiltheyarecompletelydrained.Asignificancetestisperformedusingthe
hypotheses
whereμisthetruemeanlifetimeofthenewdeluxeAAAbatteries.TheresultingP-valueis0.0276.
Whatconclusionwouldyoumakeforeachofthefollowingsignificancelevels?Justifyyouranswer.
APEXAMTIPTheconclusiontoasignificancetestshouldalwaysincludethreecomponents:(1)anexplicit
comparisonoftheP-valuetoastatedsignificancelevelORaninterpretationoftheP-valueasaconditional
probability,(2)adecisionaboutthenullhypothesis:rejectorfailtorejectH0,and(3)anexplanationofwhat
thedecisionmeansincontext.
Inpractice,themostcommonlyusedsignificancelevelisα=0.05.
Sometimesitmaybepreferabletochooseα=0.01orα=0.10,forreasonswewilldiscussshortly.Warning:if
youaregoingtodrawaconclusionbasedonstatisticalsignificance,thenthesignificancelevelαshouldbe
statedbeforethedataareproduced.Otherwise,adeceptiveuserofstatisticsmightsetanαlevelafterthe
datahavebeenanalyzedinanobviousattempttomanipulatetheconclusion.Thisisjustasinappropriateas
choosinganalternativehypothesistobeone-sidedinaparticulardirectionafterlookingatthedata.
Whychooseasignificancelevelatall?Thepurposeofasignificancetestistogiveaclearstatementofthe
strengthofevidenceprovidedbythedataagainstthenullhypothesis.TheP-valuedoesthis.Buthowsmall
aP-valueisconvincingevidenceagainstthenullhypothesis?Thisdependsmainlyontwocircumstances:
• HowplausibleisH0?IfH0representsanassumptionthatthepeopleyoumustconvincehavebelieved
foryears,strongevidence(smallP-value)willbeneededtopersuadethem.
• WhataretheconsequencesofrejectingH0?IfrejectingH0infavorofHameansmakinganexpensive
changeofsomekind,youneedstrongevidencethatthechangewillbebeneficial.
Thesecriteriaareabitsubjective.Differentpeoplewillinsistondifferentlevelsofsignificance.GivingthePvalueallowseachofustodecideindividuallyiftheevidenceissufficientlystrong.
Usersofstatisticshaveoftenemphasizedstandardsignificancelevelssuchas10%,5%,and1%.For
example,courtshavetendedtoaccept5%asastandardindiscriminationcases.Thisemphasisreflectsthe
timewhentablesofcriticalvaluesratherthantechnologydominatedstatisticalpractice.The5%levelα=0.05
isparticularlycommon(probablyduetoR.A.Fisher).
ThereisnopracticaldistinctionbetweentheP-values0.049and0.051.However,ifweuseanα=0.05
significancelevel,theformervaluewillleadustorejectH0whilethelattervaluewillleadustonotrejectH0.
BeginningusersofstatisticaltestsgenerallyfinditeasiertocompareaP-valuetoasignificancelevelthanto
interprettheP-valuecorrectlyincontext.Forthatreason,wewillincludestatingasignificancelevelasa
requiredpartofeverysignificancetest.We’llalsoaskyoutoexplainwhataP-valuemeansinavarietyof
settings.
9.1.5TypeIandTypeIIErrors
Whenwedrawaconclusionfromasignificancetest,wehopeourconclusionwillbecorrect.Butsometimesit
willbewrong.Therearetwotypesofmistakeswecanmake.Wecanrejectthenullhypothesiswhenit’s
actuallytrue,knownasaTypeIerror,orwecanfailtorejectafalsenullhypothesis,whichisaTypeIIerror.
TypeIError–rejectH0whenH0istrue,wehavecommittedaTypeIerror.
TypeIIError–IfwefailtorejectH0whenH0isfalse,wehavecommittedaTypeIIerror.
IfH0istrue,ourconclusioniscorrectifwefailtorejectH0,butitisaTypeIerrorifwerejectH0.IfHais
true,ourconclusioniseithercorrectoraTypeIIerror.Onlyoneerrorispossibleatatime.
Example–PerfectPotatoes
TypeIandTypeIIErrors
Apotatochipproduceranditsmainsupplieragreethateachshipmentofpotatoesmustmeetcertainquality
standards.Iftheproducerdeterminesthatmorethan8%ofthepotatoesintheshipmenthave“blemishes,”
thetruckwillbesentawaytogetanotherloadofpotatoesfromthesupplier.Otherwise,theentiretruckload
willbeusedtomakepotatochips.Tomakethedecision,asupervisorwillinspectarandomsampleof
potatoesfromtheshipment.Theproducerwillthenperformasignificancetestusingthehypotheses
wherepistheactualproportionofpotatoeswithblemishesinagiventruckload.
DescribeaTypeIandaTypeIIerrorinthissetting,andexplaintheconsequencesofeach.
CHECKYOURUNDERSTANDING
A company has developed a new deluxe AAA battery that is supposed to last longer than its regular AAA
battery. However, these new batteries are more expensive to produce, so the company would like to be convinced
that they really do last longer. Based on years of experience, the company knows that its regular AAA batteries last
for 30 hours of continuous use, on average. The company selects an SRS of 15 new batteries and uses them
continuously until they are completely drained. A significance test is performed using the hypotheses
where µ is the true mean lifetime of the new deluxe AAA batteries. The resulting P-value is 0.0276.
1. Describe a Type I error in this setting.
2. Describe a Type II error in this setting.
3. Which type of error is more serious in this case? Justify your answer.
ErrorProbabilitiesWecanassesstheperformanceofasignificancetestbylookingattheprobabilitiesofthe
twotypesoferror.That’sbecausestatisticalinferenceisbasedonasking,“WhatwouldhappenifIdidthis
manytimes?”Wecannot(withoutinspectingthewholetruckload)guaranteethatgoodshipmentsofpotatoes
willneverberejectedandbadshipmentswillneverbeaccepted.Butwecanthinkaboutourchancesof
makingeachofthesemistakes.
Example–PerfectPotatoes
TypeIErrorprobability
Forthetruckloadofpotatoesinthepreviousexample,weweretesting
wherepistheactualproportionofpotatoeswithblemishes.Supposethatthepotato-chipproducerdecides
tocarryoutthistestbasedonarandomsampleof500potatoesusinga5%significancelevel(α=0.05).A
TypeIerroristorejectH0whenH0isactuallytrue.Ifoursampleresultsinavalueofp-hatthatismuchlarger
than0.08,wewillrejectH0.Howlargewouldp-hatneedtobe?The5%significanceleveltellsustocount
resultsthatcouldhappenlessthan5%ofthetimebychanceifH0istrueasevidencethatH0isfalse.
AssumingH0:p=0.08istrue,thesamplingdistributionofp-hatwillhave
Shape:ApproximatelyNormalbecause500(0.08)=40and500(0.92)=460arebothatleast10.
ThefigurebelowshowstheNormalcurvethatapproximatesthissamplingdistribution.
Theshadedareaintherighttailofthefigureis5%.Valuesofp-hattotherightofthegreenline
at
willcauseustorejectH0eventhoughH0istrue.Thiswillhappenin5%ofallpossible
samples.Thatis,theprobabilityofmakingaTypeIerroris0.05.
The probability of a Type I error is the probability of rejecting H0 when it is really true. As the previous example
showed, this is exactly the significance level of the test.
WhataboutTypeIIerrors?AsignificancetestmakesaTypeIIerrorwhenitfailstorejectanullhypothesisthat
reallyisfalse.Therearemanyvaluesoftheparameterthatsatisfythealternativehypothesis,sowe
concentrateononevalue.AhighprobabilityofaTypeIIerrorforaparticularalternativemeansthatthetestis
notsensitiveenoughtousuallydetectthatalternative.Inthesignificancetestsetting,itismorecommonto
reporttheprobabilitythatatestdoesrejectH0whenanalternativeistrue.Thisprobabilityiscalled
thepowerofthetestagainstthatspecificalternative.Thehigherthisprobabilityis,themoresensitivethetest
is.
Power-ThepowerofatestagainstaspecificalternativeistheprobabilitythatthetestwillrejectH0ata
chosensignificancelevelαwhenthespecifiedalternativevalueoftheparameteristrue.
Asthefollowingexampleillustrates,TypeIIerrorandpowerarecloselylinked.
Example–PerfectPotatoes
TypeIIerrorandpower
Thepotato-chipproducerwonderswhetherthesignificancetestofH0:p=0.08versusHa:p>0.08basedona
randomsampleof500potatoeshasenoughpowertodetectashipmentwith,say,11%blemished
potatoes.Inthiscase,aparticularTypeIIerroristofailtorejectH0:p=0.08whenp=0.11.Thefigure
belowshowstwosamplingdistributionsof ,onewhenp=0.08andtheotherwhenp=0.11.
Earlier,wedecidedtorejectH0ifoursampleyieldeda
valueofp-hattotherightofthegreenline.That
decisionwasbasedonusingasignificancelevel(TypeI
errorprobability)ofα=0.05.Nowlookatthesampling
distributionforp=0.11.Theshadedarearepresents
theprobabilityofcorrectlyrejectingH0whenp=
0.11.Thatis,thepowerofthistesttodetectp=0.11is
about0.75.Inotherwords,thepotato-chipproducer
hasroughlya3-in-4chanceofrejectingatruckload
with11%blemishedpotatoesbasedonarandom
sampleof500potatoesfromtheshipment.
WewouldfailtorejectH0ifthesampleproportionphatfallstotheleftofthegreenline.Thewhitearea
showstheprobabilityoffailingtorejectH0whenH0is
false.ThisistheprobabilityofaTypeIIerror.The
potato-chipproducerhasabouta1-in-4chanceof
failingtosendawayashipmentwith11%blemished
potatoes.
The potato-chip producer decided that it is important to detect when a shipment contains 11% blemished
potatoes. Suppose the company had decided instead that detecting a shipment with 10% blemished potatoes is
important. Obviously, our power calculations would have been different. (Can you tell whether the power of the test
to detect p = 0.10 would be higher or lower than for p = 0.11?) Remember: the power of a test gives the probability
of detecting a specific alternative value of the parameter. The choice of that alternative value is usually made by
someone with a vested interest in the situation (like the potato-chip producer).
After reading the example, you might be wondering whether 0.75 is a high power or a low power. That depends on
how certain the potato-chip producer wants to be to detect a shipment with 11% blemished potatoes. The power of a
test against a specific alternative value of the parameter (like p = 0.11) is a number between 0 and 1. A power close
to 0 means the test has almost no chance of detecting that H0 is false. A power near 1 means the test is very likely to
reject H0 in favor of Ha when H0 is false.
The significance level of a test is the probability of reaching the wrong conclusion when the null hypothesis is
true. The power of a test to detect a specific alternative is the probability of reaching the right conclusion when that
alternative is true. We can just as easily describe the test by giving the probability of making a Type II
error (sometimes called β).
Calculating a Type II error probability or power by hand is possible but unpleasant. It’s better to let technology do the
work for you.
9.1.6PlanningStudies:ThePowerofaStatisticalTest
Howlargeasampleshouldwetakewhenweplantocarryoutasignificancetest?Theanswerdependson
whatalternativevaluesoftheparameterareimportanttodetect.Forinstance,thepotato-chipproducer
wantstohaveagoodchanceofrejectingH0:p=0.08infavorofHa:p>0.08ifthetrueproportionof
blemishedpotatoesinashipmentisp=0.11.Inthelastexample,wefoundthatthepowerofthetestto
detectp=0.11usingarandomsampleofsizen=500andasignificancelevelofα=0.05isabout0.75.
Herearethequestionswemustanswertodecidehowmanyobservationsweneed:
1. Significancelevel.HowmuchprotectiondowewantagainstaTypeIerror—gettingasignificantresult
fromoursamplewhenH0isactuallytrue?Byusingα=0.05,thepotato-chipproducerhasa5%
chance(α=0.05)ofmakingaTypeIerror.
2. Practicalimportance.Howlargeadifferencebetweenthehypothesizedparametervalueandthe
actualparametervalueisimportantinpractice?Thechipproducerfeelsthatit’simportanttodetecta
shipmentwith11%blemishedpotatoes—adifferenceof3%fromthehypothesizedvalueofp=0.08.
3. Power.Howconfidentdowewanttobethatourstudywilldetectadifferenceofthesizewethinkis
important?
Example–DevelopingStrongerBones
Planningastudy
Canasix-monthexerciseprogramincreasethetotalbodybonemineralcontent(TBBMC)ofyoungwomen?A
teamofresearchersisplanningastudytoexaminethisquestion.Theresearcherswouldliketoperformatest
of
whereμisthetruemeanpercentchangeinTBBMCduetotheexerciseprogram.Todecidehowmany
subjectstheyshouldincludeintheirstudy,researchersbeginbyansweringthethreequestionsabove.
1. Significancelevel.Theresearchersdecidethatα=0.05givesenoughprotectionagainstdeclaringthat
theexerciseprogramincreasesbonemineralcontentwhenitreallydoesn’t(aTypeIerror).
2. Practicalimportance.AmeanincreaseinTBBMCof1%wouldbeconsideredimportant.
3. Power.Theresearcherswantprobabilityatleast0.9thatatestatthechosensignificancelevelwill
rejectthenullhypothesisH0:μ=0whenthetruthisμ=1.
Our best advice for maximizing the power of a test is to choose as high an a level (Type I error probability) as you are
willing to risk and as large a sample size as you can afford.