Download 9.3.1 Carrying out a Significance Test for mu A company claimed to

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
9.3.1CarryingoutaSignificanceTestformu
AcompanyclaimedtohavedevelopedanewAAAbatterythatlastslongerthanitsregularAAA
batteries.Basedonyearsofexperience,thecompanyknowsthatitsregularAAAbatterieslastfor30hoursof
continuoususe,onaverage.AnSRSof15newbatterieslastedanaverageof33.9hourswithastandard
deviationof9.8hours.Dothesedatagiveconvincingevidencethatthenewbatterieslastlongeron
average?Tofindout,weperformatestof
whereμisthetruemeanlifetimeofthenewdeluxeAAAbatteries.
Conditions
Threeconditionsshouldbemetbeforeperforminginferenceaboutapopulationmean:Random,Normal,and
Independent.Aspreviouslymentioned,theNormalconditionformeansispopulationdistributionisNormalor
samplesizeislarge(n≥30).
Weoftendon’tknowwhetherthepopulationdistributionisNormal.Butifthesamplesizeislarge(n≥30),we
cansafelycarryoutasignificancetest(duetothecentrallimittheorem).Ifthesamplesizeissmall,weshould
examinethesampledataforanyobviousdeparturesfromNormality,suchasskewnessandoutliers.
Recallfromthatthetproceduresarequiterobustagainstnon-Normalityofthepopulationexceptwhen
outliersorstrongskewnessarepresent.Forsignificancetests,“robust”meansthatthestatedP-valueispretty
accurate.
Example–BetterBatteries
Thefigurebelowshowsadotplot,boxplot,andNormalprobabilityplotofthebatterylifetimesforanSRSof
15batteries.
Checktheconditionsforcarryingoutasignificancetestofthecompany’sclaimaboutit'sdeluxeAAA
batteries.
Calculations:TeststatisticandP-value
Whenperformingasignificancetest,wedocalculationsassumingthatthenullhypothesisH0istrue.Thetest
statisticmeasureshowfarthesampleresultdivergesfromtheparametervaluespecifiedbyH0,in
standardizedunits.Asbefore,
ForatestofH0:μ=μ0,ourstatisticisthesamplemeanx-bar.Itsstandarddeviationis
Inanidealworld,ourteststatisticwouldbe
Becausethepopulationstandarddeviationσisusuallyunknown,weusethesamplestandarddeviationsxin
itsplace.TheresultingteststatistichasthestandarderrorofXinthedenominator
Asmentionedearlier,whentheNormalconditionismet,thisstatistichasatdistributionwithn−1degreesof
freedom.
Example–BetterBatteries
ThebatterycompanywantstotestH0:μ=30versusHa:μ>30basedonanSRSof15newAAAbatterieswith
meanlifetime
hoursandstandarddeviationsx=9.8hours.Theteststatisticis
TheP-valueistheprobabilityofgettingaresultthislargeorlargerinthedirectionindicatedbyHa,that
is,P(t≥1.54).Thefigurebelowshowsthisprobabilityasanareaunderthetdistributioncurvewithdf=15−1
=14.WecanfindthisP-valueusingthet-distributiontable.
Gotothedf=14row.Thetstatisticfallsbetweenthevalues1.345and1.761.Ifyoulookatthetopofthe
correspondingcolumnsinthet-distributiontable,you’llfindthatthe“Upper-tailprobabilityp”isbetween
0.10and0.05.(Seethefigurebelow)SincewearelookingforP(t>1.54),thisistheprobabilityweseek.That
is,theP-valueforthistestisbetween0.05and0.10.
Asyoucansee,thet-distributiontablegivesarangeofpossibleP-valuesforasignificancetest.Wecanstill
drawaconclusionfromthetestinmuchthesamewayasifwehadasingleprobability.Inthecaseofthenew
AAAbatteries,forinstance,wedon’thaveenoughevidencetorejectH0:μ=30becausetheP-valueexceeds
ourdefaultα=0.05significancelevel.Wecan’tconcludethatthecompany’snewAAAbatterieslastlonger
than30hours,onaverage.
Thet-distributiontablehasotherlimitationsforfindingP-values.Itincludesprobabilitiesonlyfort
distributionswithdegreesoffreedomfrom1to30andthenskipstodf=40,50,60,80,100,and1000.(The
bottomrowgivesprobabilitiesfordf=∞,whichcorrespondstothestandardNormalcurve.)Inaddition,thetdistributiontableshowsprobabilitiesonlyforpositivevaluesoft.TofindaP-valueforanegativevalue
oft,weusethesymmetryofthetdistributions.Thenextexampleshowshowwedealwithbothofthese
issues.
Example–Two-SidedTests,NegativetValues,andMore
Usingthet-distributiontablewisely
Whatifyouwereperformingatestof
H0:μ=5
Ha:μ≠5
basedonasamplesizeofn=37andobtainedt=−3.17?Sincethisisatwo-sidedtest,youareinterestedin
theprobabilityofgettingavalueoftlessthan−3.17orgreaterthan3.17.Thefigurebelowshowsthe
desiredP-valueasanareaunderthetdistributioncurvewith36degreesoffreedom.
NoticethatP(t≤−3.17)=P(t≥3.17)duetothesymmetricshapeof
thedensitycurve.Sincethet-distributiontableshowsonly
positivet-values,wewillfocusont=3.17.
Sincedf=37−1=36isnotavailableonthetable,usedf=30.You
mightbetemptedtousedf=40,butdoingsowouldresultina
smallerP-valuethanyouareentitledtowithdf=36.(Inother
words,you’dbecheating!)Moveacrossthedf=30row,andnotice
thatt=3.17fallsbetween3.030and3.385.Thecorresponding
“Upper-tailprobabilityp”isbetween0.0025and0.001.(Seethefigurebelow)Forthistwo-sidedtest,the
correspondingP-valuewouldbebetween2(0.001)=0.002and2(0.0025)=0.005.
Giventhelimitationsofusingthet-tabledistribution,ouradviceistousetechnologytofindP-valueswhen
carryingoutasignificancetestaboutapopulationmean.
LearnComputingP-valuesfromtdistributiononthecalculator
CHECKYOURUNDERSTANDING
Does the job satisfaction of assembly-line workers differ when their work is machine-paced rather than selfpaced? One study chose 18 subjects at random from a company with over 200 workers who assembled electronic
devices. Half of the workers were assigned at random to each of two groups. Both groups did similar assembly
work, but one group was allowed to pace themselves while the other group used an assembly line that moved at a
fixed pace. After two weeks, all the workers took a test of job satisfaction. Then they switched work setups and took
the test again after two more weeks. The response variable is the difference in satisfaction scores, self-paced minus
machine-paced. The hypotheses are:
where µ is the mean difference in job satisfaction scores (self-paced − machine-paced) in the population of assemblyline workers at the company.Data from a random sample of 18 workers gave
and sx = 60.
1. Calculate the test statistic. Show your work.
2. Use the t-table distribution to find the P-value. What conclusion would you draw?
3. Now use your calculator to find the P-value as described in the Technology Corner. Is your result consistent with
the value you obtained in Question 2?
9.3.2TheOne-sampletTest
Example–HealthyStreams
Performingasignificancetestaboutμ
Thelevelofdissolvedoxygen(DO)inastreamorriverisanimportantindicatorofthewater’sabilityto
supportaquaticlife.AresearchermeasurestheDOlevelat15randomlychosenlocationsalongastream.Here
aretheresultsinmilligramsperliter(mg/l):
Adissolvedoxygenlevelbelow5mg/lputsaquaticlifeatrisk.
(a)Canweconcludethataquaticlifeinthisstreamisatrisk?Carryoutatestattheα=0.05significance
leveltohelpyouanswerthisquestion.
(b)Givenyourconclusioninpart(a),whichkindofmistake—aTypeIerrororaTypeIIerror—couldyou
havemade?Explainwhatthismistakewouldmeanincontext.
LearnOne-samplettestonthecalculator
APEXAMTIP
Remember:ifyoujustgivecalculatorresultswithnowork,andoneormorevaluesarewrong,youprobably
won’tgetanycreditforthe“Do”step.Werecommenddoingthecalculationwiththeappropriateformulaand
thencheckingwithyourcalculator.Ifyouoptforthecalculator-onlymethod,nametheprocedure(ttest)and
reporttheteststatistic(t=–0.94),degreesoffreedom(df=14),andP-value(0.1809).
CHECKYOURUNDERSTANDING
A college professor suspects that students at his school are getting less than 8 hours of sleep a night, on average. To
test his belief, the professor asks a random sample of 28 students, “How much sleep did you get last night?” Here are
the data (in hours):
Do these data provide convincing evidence in support of the professor’s suspicion? Carry out a significance test at
the α = 0.05 level to help answer this question.
9.3.3Two-SidedTestsandConfidenceIntervals
Example–JuicyPineapple
Atwo-sidedtest
AttheHawaiiPineappleCompany,managersareinterestedinthesizesofthepineapplesgrowninthe
company’sfields.Lastyear,themeanweightofthepineapplesharvestedfromonelargefieldwas31
ounces.Anewirrigationsystemwasinstalledinthisfieldafterthegrowingseason.Managerswonder
whetherthischangewillaffectthemeanweightoffuturepineapplesgrowninthefield.Tofindout,they
selectandweigharandomsampleof50pineapplesfromthisyear’scrop.TheMinitaboutputbelow
summarizesthedata.
(a)Determinewhetherthereareanyoutliers.Showyourwork.
(b)Dothesedatasuggestthatthemeanweightofpineapplesproducedinthefieldhaschangedthis
year?Giveappropriatestatisticalevidencetosupportyouranswer.
(c)Canweconcludethatthenewirrigationsystemcausedachangeinthemeanweightofpineapples
produced?Explainyouranswer
The significance test in the previous example concludes that the mean weight µ of the pineapples grown in the field
this year differs from last year’s 31 ounces. Unfortunately, the test doesn’t give us an idea of what the actual value
of µ is. For that, we need a confidence interval.
Example–JuicyPineapple
Confidenceintervalsgivemoreinformation
Minitaboutputforasignificancetestandconfidenceintervalbasedonthepineappledataisshown
below.TheteststatisticandP-valuematchwhatwegotearlier(uptorounding).
The95%confidenceintervalforthemeanweightofallthepineapplesgrowninthefieldthisyearis31.255to
32.616ounces.Weare95%confidentthatthisintervalcapturesthetruemeanweightμofthisyear’s
pineapplecrop.
Aswithproportions,thereisalinkbetweenatwo-sidedtestatsignificancelevelαanda100(1−α)%
confidenceintervalforapopulationmeanμ.Forthepineapples,thetwo-sidedtestatα=0.05rejectsH0:μ=
31infavorofHa:μ≠31.Thecorresponding95%confidenceintervaldoesnotinclude31asaplausiblevalueof
theparameterμ.Inotherwords,thetestandintervalleadtothesameconclusionaboutH0.Butthe
confidenceintervalprovidesmuchmoreinformation:asetofplausiblevaluesforthepopulationmean.
Theconnectionbetweentwo-sidedtestsandconfidenceintervalsisevenstrongerformeansthanitwasfor
proportions.That’sbecausebothinferencemethodsformeansusethestandarderrorofx-barinthe
calculations.
CHECKYOURUNDERSTANDING
The health director of a large company is concerned about the effects of stress on the company’s middle-aged male
employees. According to the National Center for Health Statistics, the mean systolic blood pressure for males 35 to 44
years of age is 128. The health director examines the medical records of a random sample of 72 male employees in
this age group. The Minitab output below displays the results of a significance test and a confidence interval.
1. Do the results of the significance test allow us to conclude that the mean blood pressure for all the company’s
middle-aged male employees differs from the national average? Justify your answer.
2. Interpret the 95% confidence interval in context. Explain how the confidence interval leads to the same conclusion
as in Question 1.
9.3.4InferenceforMeans:PairedData
Paireddata-Studydesignsthatinvolvemakingtwoobservationsonthesameindividual,oroneobservation
oneachoftwosimilarindividuals,resultinpaireddata.
Pairedtprocedures:Whenpaireddataresultfrommeasuringthesamequantitativevariabletwice,wecan
makecomparisonsbyanalyzingthedifferencesineachpair.Iftheconditionsforinferencearemet,wecan
useone-sampletprocedurestoperforminferenceaboutthemeandifferenceμd.Thesemethodsare
sometimescalledpairedtprocedures.
Example–IsCaffeineDependenceReal?
Paireddataandone-sampletprocedures
Researchersdesignedanexperimenttostudytheeffectsofcaffeinewithdrawal.Theyrecruited11volunteers
whowerediagnosedasbeingcaffeinedependenttoserveassubjects.Eachsubjectwasbarredfrom
coffee,colas,andothersubstanceswithcaffeineforthedurationoftheexperiment.Duringonetwo-day
period,subjectstookcapsulescontainingtheirnormalcaffeineintake.Duringanothertwo-dayperiod,they
tookplacebocapsules.Theorderinwhichsubjectstookcaffeineandtheplacebowasrandomized.Attheend
ofeachtwo-dayperiod,atestfordepressionwasgiventoall11subjects.Researcherswantedtoknow
whetherbeingdeprivedofcaffeinewouldleadtoanincreaseindepression.
Thetablebelowcontainsdataonthesubjects’scoresonadepressiontest.Higherscoresshowmore
symptomsofdepression.
(a)Whydidresearchersrandomlyassigntheorderinwhich
subjectsreceivedplaceboandcaffeine?
(b)Carryoutatesttoinvestigatetheresearchers’question.
9.3.5UsingTestsWisely
Significance tests are widely used in reporting the results of research in many fields. New drugs require significant
evidence of effectiveness and safety. Courts ask about statistical significance in hearing discrimination
cases. Marketers want to know whether a new ad campaign significantly outperforms the old one, and medical
researchers want to know whether a new therapy performs significantly better. In all these uses, statistical
significance is valued because it points to an effect that is unlikely to occur simply by chance.
Carrying out a significance test is often quite simple, especially if you use a calculator or computer. Using tests wisely
is not so simple. Here are some points to keep in mind when using or interpreting significance tests.
Statistical Significance and Practical Importance When a null hypothesis (“no effect” or “no difference”) can be
rejected at the usual levels (α = 0.05 or α = 0.01), there is good evidence of a difference. But that difference may be
very small. When large samples are available, even tiny deviations from the null hypothesis will be significant.
Example–WouldHealingTime
Significantdoesn’tmeanimportant
Supposewe’retestinganewantibacterialcream,“FormulationNS,”onasmallcutmadeontheinner
forearm.Weknowfrompreviousresearchthatwithnomedication,themeanhealingtime(definedasthe
timeforthescabtofalloff)is7.6dayswithastandarddeviationof1.4days.Theclaimwewanttotesthereis
thatFormulationNSspeedshealing.Wewillusea5%significancelevel.
Procedure:Wecutarandomsampleof25collegestudentsandapplyFormulationNStothewounds.The
meanhealingtimeforthesesubjectsis
daysandthestandarddeviationissx=1.4days.
Discussion:Wewanttotestaclaimaboutthemeanhealingtimeμinthepopulationofcollegestudents
whosecutsaretreatedwithFormulationNS.Ourhypothesesare
Anexaminationofthedatarevealsnooutliersorstrongskewness,sotheconditionsforperformingaonesamplettestaremet.Wecarryoutthetestandfindthatt=−1.79andP-value=0.043.Since0.043isless
thanα=0.05,werejectH0andconcludethatFormulationNS’shealingeffectisstatistically
significant.However,thisresultisnotpracticallyimportant.Havingyourscabfalloffhalfadaysoonerisnobig
deal.
Remember the wise saying: Statistical significance is not the same thing as practical importance. The remedy for
attaching too much importance to statistical significance is to pay attention to the actual data as well as to thePvalue. Plot your data and examine them carefully. Are there outliers or other deviations from a consistent pattern? A
few outlying observations can produce highly significant results if you blindly apply common significance
tests. Outliers can also destroy the significance of otherwise-convincing data.
The foolish user of statistics who feeds the data to a calculator or computer without exploratory analysis will often be
embarrassed. Is the difference you are seeking visible in your plots? If not, ask yourself whether the difference is
large enough to be practically important. Give a confidence interval for the parameter in which you are interested. A
confidence interval actually estimates the size of the difference rather than simply asking if it is too large to
reasonably occur by chance alone. Confidence intervals are not used as often as they should be, whereas significance
tests are perhaps overused.
Don’t Ignore Lack of Significance There is a tendency to infer that there is no difference whenever a P-value fails
to attain the usual 5% standard. A provocative editorial in the British Medical Journal entitled “Absence of Evidence Is
Not Evidence of Absence” deals with this issue.22 Here is one of the examples they cite
Example–ReducingHIVTransmission
Interpretinglackofsignificance
InanexperimenttocomparemethodsforreducingtransmissionofHIV,subjectswererandomlyassignedtoa
treatmentgroupandacontrolgroup.Result:thetreatmentgroupandthecontrolgrouphadthesamerateof
HIVinfection.Researchersdescribedthisasan“incidentrateratio”of1.00.Aratioabove1.00wouldmean
thattherewasagreaterrateofHIVinfectioninthetreatmentgroup,whilearatiobelow1.00wouldindicatea
greaterrateofHIVinfectioninthecontrolgroup.The95%confidenceintervalfortheincidentrateratiowas
reportedas0.63to1.58.SayingthatthetreatmenthasnoeffectonHIVinfectionismisleading.The
confidenceintervalfortheincidentrateratioindicatesthatthetreatmentmaybeabletoachievea37%
decreaseininfection.Itmightalsoproducea58%increaseininfection.Clearly,moredataareneededto
distinguishbetweenthesepossibilities.
The situation can be worse. Research in some fields has rarely been published unless significance at the 5% level is
attained. For instance, a survey of four journals published by the American Psychological Association showed that of
294 articles using statistical tests, only 8 reported results that did not attain the 5% significance level.24 That’s too
bad, because we can learn a great deal from studies that fail to find convincing evidence.
In some areas of research, small differences that are detectable only with large sample sizes can be of great practical
significance. Data accumulated from a large number of patients taking a new drug may be needed before we can
conclude that there are life-threatening consequences for a small number of people. When planning a study, verify
that the test you plan to use has a high probability (power) of detecting a difference of the size you hope to find.
Statistical Inference Is Not Valid for All Sets of Data Badly designed surveys or experiments often produce
invalid results. Formal statistical inference cannot correct basic flaws in the design. Each test is valid only in certain
circumstances, with properly produced data being particularly important.
Example–DoesMusicIncreaseWorkerProductivity
Wheninferenceisn’tvalid
Youwonderwhetherbackgroundmusicwouldimprovetheproductivityofthestaffwhoprocessmailorders
inyourbusiness.Afterdiscussingtheideawiththeworkers,youaddmusicandfindasignificantincrease.You
shouldnotbeimpressed.Infact,almostanychangeintheworkenvironmenttogetherwithknowledgethata
studyisunderwaywillproduceashort-termproductivityincrease.ThisistheHawthorneeffect,namedafter
theWesternElectricmanufacturingplantwhereitwasfirstnoted.
Thesignificancetestcorrectlyinformsyouthatanincreasehasoccurredthatislargerthanwouldoftenarise
bychancealone.Itdoesnottellyouwhatotherthanchancecausedtheincrease.Themostplausible
explanationisthatworkerschangetheirbehaviorwhentheyknowtheyarebeingstudied.Yourexperiment
wasuncontrolled,sothesignificantresultcannotbeinterpreted.Arandomizedcomparativeexperiment
wouldisolatetheactualeffectofbackgroundmusicandsomakesignificancemeaningful.
Hawthorneeffect-Thefactthatalmostanychangeintheworkenvironmenttogetherwithknowledgethata
studyisunderwaywillproduceashort-termproductivityincrease.
Significance tests and confidence intervals are based on the laws of probability. Random sampling and random
assignment ensure that these laws apply. Always ask how the data were produced. Don’t be too impressed by Pvalues on a printout until you are confident that the data deserve a formal analysis.
Beware of Multiple Analyses Statistical significance ought to mean that you have found a difference that you were
looking for. The reasoning behind statistical significance works well if you decide what difference you are
seeking, design a study to search for it, and use a significance test to weigh the evidence you get. In other
settings, significance may have little meaning.
Example–CellPhonesandBrainCancer
Don’tsearchforsignificance
Mighttheradiationfromcellphonesbeharmfultousers?Manystudieshavefoundlittleornoconnection
betweenusingcellphonesandvariousillnesses.Hereispartofanewsaccountofonestudy:
Ahospitalstudythatcomparedbraincancerpatientsandasimilargroupwithoutbraincancerfound
nostatisticallysignificantassociationbetweencellphoneuseandagroupofbraincancersknownas
gliomas.Butwhen20typesofgliomawereconsideredseparately,anassociationwasfoundbetween
phoneuseandonerareform.Puzzlingly,however,thisriskappearedtodecreaseratherthanincrease
withgreatermobilephoneuse.
Thinkforamoment.Supposethatthe20nullhypothesesforthese20significancetestsarealltrue.Theneach
testhasa5%chanceofbeingsignificantatthe5%level.That’swhatα=0.05means:resultsthisextreme
occuronly5%ofthetimejustbychancewhenthenullhypothesisistrue.Weexpectabout1of20teststo
giveasignificantresultjustbychance.Runningonetestandreachingtheα=0.05levelisreasonablygood
evidencethatyouhavefoundsomething;running20testsandreachingthatlevelonlyonceisnot.