Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
9.1.1TheReasoningofSignificanceTests SignificanceTest-aformalprocedureforcomparingobserveddatawithaclaim(alsocalled ahypothesis)whosetruthwewanttoassess.Theclaimisastatementaboutaparameter,likethepopulation proportionporthepopulationmeanμ.Weexpresstheresultsofasignificancetestintermsofaprobability thatmeasureshowwellthedataandtheclaimagree. Statisticaltestsdealwithclaimsaboutapopulation.Testsaskifsampledatagivegoodevidenceagainsta claim.Atestmightsay,“Ifwetookmanyrandomsamplesandtheclaimweretrue,wewouldrarelygeta resultlikethis.”Togetanumericalmeasureofhowstrongthesampleevidenceis,replacethevagueterm “rarely”byaprobability. Example–I’maGreatFree-ThrowShooter! Thelogicofstatisticaltests Abasketballplayerclaimstobean80%freethrowshooter.Supposeheshoots50freethrowsandmakes32 ofthem.Hissampleproportionofmadeshotsis Whatcanweconcludeabouttheplayer’sclaimbasedonthesampledata? Ourreasoningisbasedonaskingwhatwouldhappeniftheplayer’sclaim(p=0.80)weretrueandwe observedmanysamplesof50freethrows.WeusedFathomsoftwaretosimulate400setsof50shots assumingthattheplayerisreallyan80%shooter.Thefigurebelowshowsadotplotoftheresults.Eachdoton thegraphrepresentstheproportionofmadeshotsinonesetof50attempts.Forexample,iftheplayermakes 43/50shotsinonetrial,thedotwouldbeplacedatp-hat=0.86. Youcansayhowstrongtheevidenceagainstthe player’sclaimisbygivingtheprobabilitythathe wouldmakeasfewas32outof50freethrowsif hereallymakes80%inthelongrun.Basedonthe simulation,ourestimateofthisprobabilityis 3/400=0.0075.Theobservedstatistic,p-hat= 0.64,issounlikelyiftheactualparametervalue isp=0.80thatitgivesconvincingevidencethat theplayer’sclaimisnottrue. Besurethatyouunderstandwhythisevidenceisconvincing.Therearetwopossibleexplanationsofthefact thatourvirtualplayermadeonly ofhisfreethrows: 1. Theplayer’sclaimiscorrect(p=0.8),andbybadluck,averyunlikelyoutcomeoccurred. 2. Thepopulationproportionisactuallylessthan0.8,sothesampleresultisnotanunlikelyoutcome. Explanation1mightbecorrect—theresultofourrandomsampleof50shotscouldbeduetochance alone.Buttheprobabilitythatsucharesultwouldoccurbychanceissosmall(lessthan1ina100)thatweare quiteconfidentthatExplanation2isright. Statisticaltestsuseanelaboratevocabulary,butthebasicideaissimple:anoutcomethatwouldrarely happenifaclaimweretrueisgoodevidencethattheclaimisnottrue. 9.1.2StatingHypothesis Asignificanceteststartswithacarefulstatementoftheclaimswewanttocompare.Inourfree-throwshooter example,thevirtualplayerclaimsthathislong-runproportionofmadefreethrowsisp=0.80.Thisisthe claimweseekevidenceagainst.Wecallitthenullhypothesis,abbreviatedH0.Usually,thenullhypothesisisa statementof“nodifference.”Forthefree-throwshooter,nodifferencefromwhatheclaimedgivesH0:p= 0.80.Theclaimwehopeorsuspecttobetrueinsteadofthenullhypothesisiscalledthealternative hypothesis.WeabbreviatethealternativehypothesisasHa.Inthiscase,webelievetheplayermightbe exaggerating,soouralternativehypothesisisHa:p<0.80. NullHypothesis-Theclaimtestedbyastatisticaltestiscalledthenullhypothesis(H0).Thetestisdesignedto assessthestrengthoftheevidenceagainstthenullhypothesis.Oftenthenullhypothesisisastatementof“no difference.” AlternativeHypothesis-Theclaimaboutthepopulationthatwearetryingtofindevidenceforisthe alternativehypothesis(Ha). Inthefree-shooterexample,ourhypothesesare: wherepisthelong-runproportionofmadefreethrows.Thealternativehypothesisisone-sidedbecausewe areinterestedonlyinwhethertheplayerisoverstatinghisfree-throwshootingability.BecauseHaexpresses theeffectthatwehopetofindevidencefor,itissometimeseasiertobeginbystatingHaandthensetupH0as thestatementthatthehoped-foreffectisnotpresent.Hereisanexampleinwhichthealternativehypothesis istwo-sided. Example–StudyingJobSatisfaction Statinghypotheses Doesthejobsatisfactionofassembly-lineworkersdifferwhentheirworkismachine-pacedratherthanselfpaced?Onestudychose18subjectsatrandomfromacompanywithover200workerswhoassembled electronicdevices.Halfoftheworkerswereassignedatrandomtoeachoftwogroups.Bothgroupsdid similarassemblywork,butonegroupwasallowedtopacethemselveswhiletheothergroupusedan assemblylinethatmovedatafixedpace.Aftertwoweeks,alltheworkerstookatestofjobsatisfaction.Then theyswitchedworksetupsandtookthetestagainaftertwomoreweeks.(Thisexperimentusesamatched pairsdesign,whichyoulearnedaboutinChapter4)Theresponsevariableisthedifferenceinsatisfaction scores,self-pacedminusmachine-paced. (a)Describetheparameterofinterestinthissetting. (b)Stateappropriatehypothesesforperformingasignificancetest. Thehypothesesshouldexpressthehopesorsuspicionswehavebeforeweseethedata.Itischeatingtolookat thedatafirstandthenframehypothesestofitwhatthedatashow.Forexample,thedataforthejob satisfactionstudyshowedthattheworkersweremoresatisfiedwithself-pacedwork.Butyou shouldnotchangethealternativehypothesistoHa:μ>0afterlookingatthedata.Ifyoudonothaveaspecific directionfirmlyinmindinadvance,useatwo-sidedalternative. One-sidedalternativehypothesis-thealternativehypothesisisone-sidedifitstatesthataparameter islargerthanthenullhypothesisvalueorifitstatesthattheparameterissmallerthanthenullvalue. Two-sidedalternativehypothesis-thealternativehypothesisistwo-sidedifitstatesthattheparameter isdifferentfromthenullhypothesisvalue(itcouldbeeitherlargerorsmaller). Inanysignificancetest,thenullhypothesishastheformH0:parameter=value.Thealternativehypothesishas oneoftheformsHa:parameter<value,Ha:parameter>value,orHa:parameter≠value.Todeterminethe correctformofHa,readtheproblemcarefully. Hypothesesalwaysrefertoapopulation,nottoasample.BesuretostateH0andHaintermsofpopulation parameters.Itisnevercorrecttowriteahypothesisaboutasamplestatistic,suchas or . CHECKYOURUNDERSTANDING For each of the following settings, (a) describe the parameter of interest, and(b) state appropriate hypotheses for a significance test. 1. According to the Web site sleepdeprivation.com, 85% of teens are getting less than eight hours of sleep a night. Jannie wonders whether this result holds in her large high school. She asks an SRS of 100 students at the school how much sleep they get on a typical night. In all, 75 of the responders said less than 8 hours. 2. As part of its 2010 census marketing campaign, the U.S. Census Bureau advertised “10 questions, 10 minutes— that’s all it takes.” On the census form itself, we read, “The U.S. Census Bureau estimates that, for the average household, this form will take about 10 minutes to complete, including the time for reviewing the instructions and answers.” We suspect that the actual time it takes to complete the form may be longer than advertised. 9.1.3InterpretingP-values Theideaofstatinganullhypothesisthatwewanttofindevidenceagainstseemsoddatfirst.Itmayhelpto thinkofacriminaltrial.Thedefendantis“innocentuntilprovenguilty.”Thatis,thenullhypothesisis innocenceandtheprosecutionmusttrytoprovideconvincingevidenceagainstthishypothesis.That’sexactly howstatisticaltestswork,althoughinstatisticswedealwithevidenceprovidedbydataanduseaprobability tosayhowstrongtheevidenceis. ThenullhypothesisH0statestheclaimthatweareseekingevidenceagainst.Theprobabilitythatmeasures thestrengthoftheevidenceagainstanullhypothesisiscalledaP-value. P-value-Theprobability,computedassumingH0istrue,thatthestatistic(suchasp-hatorx-bar)wouldtakea valueasextremeasormoreextremethantheoneactuallyobservediscalledtheP-valueofthetest.The smallertheP-value,thestrongertheevidenceagainstH0providedbythedata. SmallP-valuesareevidenceagainstH0becausetheysaythattheobservedresultisunlikelytooccur whenH0istrue.LargeP-valuesfailtogiveconvincingevidenceagainstH0becausetheysaythattheobserved resultislikelytooccurbychancewhenH0istrue.We’llshowyouhowtocalculateP-valueslater.For now,let’sfocusoninterpretingthem. Example–I’maGreatFree-ThrowShooter InterpretingaP-value TheP-valueistheprobabilityofgettingasampleresultatleastasextremeastheonewedidifH0were true.SincethealternativehypothesisisHa:p<0.80,thesampleresultsthatcountas“atleastasextreme”are thosewith Inotherwords,theP-valueistheconditionalprobability Earlier,weusedasimulationtoestimatethisprobabilityas3/400=0.0075.SoifH0istrue,andtheplayer makes80%ofhisfreethrowsinthelongrun,there’slessthana1in100chancethattheplayerwouldmake asfewas32of50shots.ThesmallprobabilitygivesstrongevidenceagainstH0andinfavorofthe alternativeHa:p<0.80. ThealternativehypothesissetsthedirectionthatcountsasevidenceagainstH0.Inthepreviousexample,only valuesofp-hatthataremuchlessthan0.80countasevidenceagainstthenullhypothesisbecausethe alternativeisone-sidedonthelowside.Ifthealternativeistwo-sided,bothdirectionscount. Example–StudyingJobSatisfaction InterpretingaP-value Doesthejobsatisfactionofassembly-lineworkersdifferwhentheirworkismachine-pacedratherthanselfpaced?Onestudychose18subjectsatrandomfromacompanywithover200workerswhoassembled electronicdevices.Halfoftheworkerswereassignedatrandomtoeachoftwogroups.Bothgroupsdid similarassemblywork,butonegroupwasallowedtopacethemselveswhiletheothergroupusedan assemblylinethatmovedatafixedpace.Aftertwoweeks,alltheworkerstookatestofjobsatisfaction.Then theyswitchedworksetupsandtookthetestagainaftertwomoreweeks.(Thisexperimentusesamatched pairsdesign,whichyoulearnedaboutinChapter4)Theresponsevariableisthedifferenceinsatisfaction scores,self-pacedminusmachine-paced. Forthejobsatisfactionstudydescribedabove,thehypothesesare: whereμisthemeandifferenceinjobsatisfactionscores(self-paced−machine-paced)inthepopulationof assembly-lineworkersatthecompany.Datafromthe18workersgave andsx=60.Thatis,these workersratedtheself-pacedenvironment,onaverage,17pointshigher. ResearchersperformedasignificancetestusingthesampledataandobtainedaP-valueof0.2302. (a)Explainwhatitmeansforthenullhypothesistobetrueinthissetting. (b)InterprettheP-valueincontext. (c)Dothedataprovideconvincingevidenceagainstthenullhypothesis?Explain. The conclusion of the job satisfaction study is not that H0 is true. The study looked for evidence against H0: µ = 0 and failed to find strong evidence. That is all we can say. Failing to find evidence against H0 means only that the data are consistent with H0, not that we have clear evidence that H0 is true. 9.1.4StatisticalSignificance Thefinalstepinperformingasignificancetestistodrawaconclusionaboutthecompetingclaimsyouwere testing.Wewillmakeoneoftwodecisionsbasedonthestrengthoftheevidenceagainstthenull hypothesis(andinfavorofthealternativehypothesis)—rejectH0orfailtorejectH0.Ifoursampleresultistoo unlikelytohavehappenedbychanceassumingH0istrue,thenwe’llrejectH0.Otherwise,wewillfailto rejectH0. Thiswordingmayseemunusualatfirst,butit’sconsistentwithwhathappensinacriminaltrial.Oncethejury hasweighedtheevidenceagainstthenullhypothesisofinnocence,theyreturnoneoftwo verdicts:“guilty”(rejectH0)or“notguilty”(failtorejectH0).Anot-guiltyverdictdoesn’tguaranteethatthe defendantisinnocent,justthatthere’snotenoughevidenceofguilt.Likewise,afail-to-rejectH0decisionina significancetestdoesn’tmeanthatH0istrue.Forthatreason,youshouldnever“acceptH0”oruselanguage implyingthatyoubelieveH0istrue. Example–FreeThrowsandJobSatisfaction Drawingconclusions Inthefree-throwshooterexample,theestimatedP-valueof0.0075isstrongevidenceagainstthenull hypothesisH0:p=0.80.Forthatreason,wewouldrejectH0infavorofthealternativeHa:p<0.80.Itappears thatthevirtualplayermakesfewerthan80%ofhisfreethrows. Forthejobsatisfactionexperiment,however,theP-valueof0.2302isnotconvincingevidenceagainstH0:μ= 0.WethereforefailtorejectH0.Researcherscannotconcludethatjobsatisfactiondiffersbasedonwork conditionsforthepopulationofassembly-lineworkersatthecompany. Inanutshell,ourconclusioninasignificancetestcomesdownto ThereisnoruleforhowsmallaP-valueweshouldrequireinordertorejectH0—it’samatterofjudgmentand dependsonthespecificcircumstances.ButwecancomparetheP-valuewithafixedvaluethatweregardas decisive,calledthesignificancelevel.Wewriteitasα,theGreekletteralpha.Ifwechooseα=0.05,weare requiringthatthedatagiveevidenceagainstH0sostrongthatitwouldhappenlessthan5%ofthetimejustby chancewhenH0istrue.Ifwechooseα=0.01,weareinsistingonstrongerevidenceagainstthenull hypothesis,aresultthatwouldoccurlessoftenthan1inevery100timesinthelongrunifH0istrue.When ourP-valueislessthanthechosenα,wesaythattheresultisstatisticallysignificant. StatisticallySignificant-IftheP-valueissmallerthanalpha,wesaythatthedataarestatisticallysignificantat levelα.Inthatcase,werejectthenullhypothesisH0andconcludethatthereisconvincingevidenceinfavorof thealternativehypothesisHa. “Significant”inthestatisticalsensedoesnotnecessarilymean“important.”Itmeanssimply“notlikelyto happenjustbychance.”Thesignificancelevelαmakes“notlikely”moreexact.Significanceatlevel0.01is oftenexpressedbythestatement,“Theresultsweresignificant(p<0.01).” HerePstandsfortheP-value.TheactualP-valueismoreinformativethanastatementofsignificancebecause itallowsustoassesssignificanceatanylevelwechoose.Forexample,aresultwithP=0.03issignificantat theα=0.05levelbutisnotsignificantattheα=0.01level.Whenweuseafixedsignificanceleveltodrawa conclusioninastatisticaltest, Example–BetterBatteries StatisticalSignificance AcompanyhasdevelopedanewdeluxeAAAbatterythatissupposedtolastlongerthanitsregularAAA battery.However,thesenewbatteriesaremoreexpensivetoproduce,sothecompanywouldliketobe convincedthattheyreallydolastlonger.Basedonyearsofexperience,thecompanyknowsthatitsregular AAAbatterieslastfor30hoursofcontinuoususe,onaverage.ThecompanyselectsanSRSof15newbatteries andusesthemcontinuouslyuntiltheyarecompletelydrained.Asignificancetestisperformedusingthe hypotheses whereμisthetruemeanlifetimeofthenewdeluxeAAAbatteries.TheresultingP-valueis0.0276. Whatconclusionwouldyoumakeforeachofthefollowingsignificancelevels?Justifyyouranswer. APEXAMTIPTheconclusiontoasignificancetestshouldalwaysincludethreecomponents:(1)anexplicit comparisonoftheP-valuetoastatedsignificancelevelORaninterpretationoftheP-valueasaconditional probability,(2)adecisionaboutthenullhypothesis:rejectorfailtorejectH0,and(3)anexplanationofwhat thedecisionmeansincontext. Inpractice,themostcommonlyusedsignificancelevelisα=0.05. Sometimesitmaybepreferabletochooseα=0.01orα=0.10,forreasonswewilldiscussshortly.Warning:if youaregoingtodrawaconclusionbasedonstatisticalsignificance,thenthesignificancelevelαshouldbe statedbeforethedataareproduced.Otherwise,adeceptiveuserofstatisticsmightsetanαlevelafterthe datahavebeenanalyzedinanobviousattempttomanipulatetheconclusion.Thisisjustasinappropriateas choosinganalternativehypothesistobeone-sidedinaparticulardirectionafterlookingatthedata. Whychooseasignificancelevelatall?Thepurposeofasignificancetestistogiveaclearstatementofthe strengthofevidenceprovidedbythedataagainstthenullhypothesis.TheP-valuedoesthis.Buthowsmall aP-valueisconvincingevidenceagainstthenullhypothesis?Thisdependsmainlyontwocircumstances: • HowplausibleisH0?IfH0representsanassumptionthatthepeopleyoumustconvincehavebelieved foryears,strongevidence(smallP-value)willbeneededtopersuadethem. • WhataretheconsequencesofrejectingH0?IfrejectingH0infavorofHameansmakinganexpensive changeofsomekind,youneedstrongevidencethatthechangewillbebeneficial. Thesecriteriaareabitsubjective.Differentpeoplewillinsistondifferentlevelsofsignificance.GivingthePvalueallowseachofustodecideindividuallyiftheevidenceissufficientlystrong. Usersofstatisticshaveoftenemphasizedstandardsignificancelevelssuchas10%,5%,and1%.For example,courtshavetendedtoaccept5%asastandardindiscriminationcases.Thisemphasisreflectsthe timewhentablesofcriticalvaluesratherthantechnologydominatedstatisticalpractice.The5%levelα=0.05 isparticularlycommon(probablyduetoR.A.Fisher). ThereisnopracticaldistinctionbetweentheP-values0.049and0.051.However,ifweuseanα=0.05 significancelevel,theformervaluewillleadustorejectH0whilethelattervaluewillleadustonotrejectH0. BeginningusersofstatisticaltestsgenerallyfinditeasiertocompareaP-valuetoasignificancelevelthanto interprettheP-valuecorrectlyincontext.Forthatreason,wewillincludestatingasignificancelevelasa requiredpartofeverysignificancetest.We’llalsoaskyoutoexplainwhataP-valuemeansinavarietyof settings. 9.1.5TypeIandTypeIIErrors Whenwedrawaconclusionfromasignificancetest,wehopeourconclusionwillbecorrect.Butsometimesit willbewrong.Therearetwotypesofmistakeswecanmake.Wecanrejectthenullhypothesiswhenit’s actuallytrue,knownasaTypeIerror,orwecanfailtorejectafalsenullhypothesis,whichisaTypeIIerror. TypeIError–rejectH0whenH0istrue,wehavecommittedaTypeIerror. TypeIIError–IfwefailtorejectH0whenH0isfalse,wehavecommittedaTypeIIerror. IfH0istrue,ourconclusioniscorrectifwefailtorejectH0,butitisaTypeIerrorifwerejectH0.IfHais true,ourconclusioniseithercorrectoraTypeIIerror.Onlyoneerrorispossibleatatime. Example–PerfectPotatoes TypeIandTypeIIErrors Apotatochipproduceranditsmainsupplieragreethateachshipmentofpotatoesmustmeetcertainquality standards.Iftheproducerdeterminesthatmorethan8%ofthepotatoesintheshipmenthave“blemishes,” thetruckwillbesentawaytogetanotherloadofpotatoesfromthesupplier.Otherwise,theentiretruckload willbeusedtomakepotatochips.Tomakethedecision,asupervisorwillinspectarandomsampleof potatoesfromtheshipment.Theproducerwillthenperformasignificancetestusingthehypotheses wherepistheactualproportionofpotatoeswithblemishesinagiventruckload. DescribeaTypeIandaTypeIIerrorinthissetting,andexplaintheconsequencesofeach. CHECKYOURUNDERSTANDING A company has developed a new deluxe AAA battery that is supposed to last longer than its regular AAA battery. However, these new batteries are more expensive to produce, so the company would like to be convinced that they really do last longer. Based on years of experience, the company knows that its regular AAA batteries last for 30 hours of continuous use, on average. The company selects an SRS of 15 new batteries and uses them continuously until they are completely drained. A significance test is performed using the hypotheses where µ is the true mean lifetime of the new deluxe AAA batteries. The resulting P-value is 0.0276. 1. Describe a Type I error in this setting. 2. Describe a Type II error in this setting. 3. Which type of error is more serious in this case? Justify your answer. ErrorProbabilitiesWecanassesstheperformanceofasignificancetestbylookingattheprobabilitiesofthe twotypesoferror.That’sbecausestatisticalinferenceisbasedonasking,“WhatwouldhappenifIdidthis manytimes?”Wecannot(withoutinspectingthewholetruckload)guaranteethatgoodshipmentsofpotatoes willneverberejectedandbadshipmentswillneverbeaccepted.Butwecanthinkaboutourchancesof makingeachofthesemistakes. Example–PerfectPotatoes TypeIErrorprobability Forthetruckloadofpotatoesinthepreviousexample,weweretesting wherepistheactualproportionofpotatoeswithblemishes.Supposethatthepotato-chipproducerdecides tocarryoutthistestbasedonarandomsampleof500potatoesusinga5%significancelevel(α=0.05).A TypeIerroristorejectH0whenH0isactuallytrue.Ifoursampleresultsinavalueofp-hatthatismuchlarger than0.08,wewillrejectH0.Howlargewouldp-hatneedtobe?The5%significanceleveltellsustocount resultsthatcouldhappenlessthan5%ofthetimebychanceifH0istrueasevidencethatH0isfalse. AssumingH0:p=0.08istrue,thesamplingdistributionofp-hatwillhave Shape:ApproximatelyNormalbecause500(0.08)=40and500(0.92)=460arebothatleast10. ThefigurebelowshowstheNormalcurvethatapproximatesthissamplingdistribution. Theshadedareaintherighttailofthefigureis5%.Valuesofp-hattotherightofthegreenline at willcauseustorejectH0eventhoughH0istrue.Thiswillhappenin5%ofallpossible samples.Thatis,theprobabilityofmakingaTypeIerroris0.05. The probability of a Type I error is the probability of rejecting H0 when it is really true. As the previous example showed, this is exactly the significance level of the test. WhataboutTypeIIerrors?AsignificancetestmakesaTypeIIerrorwhenitfailstorejectanullhypothesisthat reallyisfalse.Therearemanyvaluesoftheparameterthatsatisfythealternativehypothesis,sowe concentrateononevalue.AhighprobabilityofaTypeIIerrorforaparticularalternativemeansthatthetestis notsensitiveenoughtousuallydetectthatalternative.Inthesignificancetestsetting,itismorecommonto reporttheprobabilitythatatestdoesrejectH0whenanalternativeistrue.Thisprobabilityiscalled thepowerofthetestagainstthatspecificalternative.Thehigherthisprobabilityis,themoresensitivethetest is. Power-ThepowerofatestagainstaspecificalternativeistheprobabilitythatthetestwillrejectH0ata chosensignificancelevelαwhenthespecifiedalternativevalueoftheparameteristrue. Asthefollowingexampleillustrates,TypeIIerrorandpowerarecloselylinked. Example–PerfectPotatoes TypeIIerrorandpower Thepotato-chipproducerwonderswhetherthesignificancetestofH0:p=0.08versusHa:p>0.08basedona randomsampleof500potatoeshasenoughpowertodetectashipmentwith,say,11%blemished potatoes.Inthiscase,aparticularTypeIIerroristofailtorejectH0:p=0.08whenp=0.11.Thefigure belowshowstwosamplingdistributionsof ,onewhenp=0.08andtheotherwhenp=0.11. Earlier,wedecidedtorejectH0ifoursampleyieldeda valueofp-hattotherightofthegreenline.That decisionwasbasedonusingasignificancelevel(TypeI errorprobability)ofα=0.05.Nowlookatthesampling distributionforp=0.11.Theshadedarearepresents theprobabilityofcorrectlyrejectingH0whenp= 0.11.Thatis,thepowerofthistesttodetectp=0.11is about0.75.Inotherwords,thepotato-chipproducer hasroughlya3-in-4chanceofrejectingatruckload with11%blemishedpotatoesbasedonarandom sampleof500potatoesfromtheshipment. WewouldfailtorejectH0ifthesampleproportionphatfallstotheleftofthegreenline.Thewhitearea showstheprobabilityoffailingtorejectH0whenH0is false.ThisistheprobabilityofaTypeIIerror.The potato-chipproducerhasabouta1-in-4chanceof failingtosendawayashipmentwith11%blemished potatoes. The potato-chip producer decided that it is important to detect when a shipment contains 11% blemished potatoes. Suppose the company had decided instead that detecting a shipment with 10% blemished potatoes is important. Obviously, our power calculations would have been different. (Can you tell whether the power of the test to detect p = 0.10 would be higher or lower than for p = 0.11?) Remember: the power of a test gives the probability of detecting a specific alternative value of the parameter. The choice of that alternative value is usually made by someone with a vested interest in the situation (like the potato-chip producer). After reading the example, you might be wondering whether 0.75 is a high power or a low power. That depends on how certain the potato-chip producer wants to be to detect a shipment with 11% blemished potatoes. The power of a test against a specific alternative value of the parameter (like p = 0.11) is a number between 0 and 1. A power close to 0 means the test has almost no chance of detecting that H0 is false. A power near 1 means the test is very likely to reject H0 in favor of Ha when H0 is false. The significance level of a test is the probability of reaching the wrong conclusion when the null hypothesis is true. The power of a test to detect a specific alternative is the probability of reaching the right conclusion when that alternative is true. We can just as easily describe the test by giving the probability of making a Type II error (sometimes called β). Calculating a Type II error probability or power by hand is possible but unpleasant. It’s better to let technology do the work for you. 9.1.6PlanningStudies:ThePowerofaStatisticalTest Howlargeasampleshouldwetakewhenweplantocarryoutasignificancetest?Theanswerdependson whatalternativevaluesoftheparameterareimportanttodetect.Forinstance,thepotato-chipproducer wantstohaveagoodchanceofrejectingH0:p=0.08infavorofHa:p>0.08ifthetrueproportionof blemishedpotatoesinashipmentisp=0.11.Inthelastexample,wefoundthatthepowerofthetestto detectp=0.11usingarandomsampleofsizen=500andasignificancelevelofα=0.05isabout0.75. Herearethequestionswemustanswertodecidehowmanyobservationsweneed: 1. Significancelevel.HowmuchprotectiondowewantagainstaTypeIerror—gettingasignificantresult fromoursamplewhenH0isactuallytrue?Byusingα=0.05,thepotato-chipproducerhasa5% chance(α=0.05)ofmakingaTypeIerror. 2. Practicalimportance.Howlargeadifferencebetweenthehypothesizedparametervalueandthe actualparametervalueisimportantinpractice?Thechipproducerfeelsthatit’simportanttodetecta shipmentwith11%blemishedpotatoes—adifferenceof3%fromthehypothesizedvalueofp=0.08. 3. Power.Howconfidentdowewanttobethatourstudywilldetectadifferenceofthesizewethinkis important? Example–DevelopingStrongerBones Planningastudy Canasix-monthexerciseprogramincreasethetotalbodybonemineralcontent(TBBMC)ofyoungwomen?A teamofresearchersisplanningastudytoexaminethisquestion.Theresearcherswouldliketoperformatest of whereμisthetruemeanpercentchangeinTBBMCduetotheexerciseprogram.Todecidehowmany subjectstheyshouldincludeintheirstudy,researchersbeginbyansweringthethreequestionsabove. 1. Significancelevel.Theresearchersdecidethatα=0.05givesenoughprotectionagainstdeclaringthat theexerciseprogramincreasesbonemineralcontentwhenitreallydoesn’t(aTypeIerror). 2. Practicalimportance.AmeanincreaseinTBBMCof1%wouldbeconsideredimportant. 3. Power.Theresearcherswantprobabilityatleast0.9thatatestatthechosensignificancelevelwill rejectthenullhypothesisH0:μ=0whenthetruthisμ=1. Our best advice for maximizing the power of a test is to choose as high an a level (Type I error probability) as you are willing to risk and as large a sample size as you can afford.