Download 1.3.1 Measuring Center: The Mean Mean

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
1.3.1MeasuringCenter:TheMean
Mean-Thearithmeticaverage.Tofindthemean (pronouncedxbar)ofasetofobservations,add
theirvaluesanddividebythenumberofobservations.Ifthenobservationsarex1,x2,…,xn,theirmean
is:
Or
Actually,thenotation referstothemeanofasample.Mostofthetime,thedatawe’llencounter
canbethoughtofasasamplefromsomelargerpopulation.Whenweneedtoreferto
apopulationmean,we’llusethesymbolμ(Greeklettermu,pronounced“mew”).Ifyouhavethe
entirepopulationofdataavailable,thenyoucalculateμinjustthewayyou’dexpect:addthevaluesof
alltheobservations,anddividebythenumberofobservations.
Example–TravelTimestoWorkinNorthCarolina
Calculatingthemean
Belowisdataontraveltimesof15NorthCarolinaresidents.
1)Findthemeantraveltimeforall15workers
2)Calculatethemeanagain,thistimeexcludingthepersonwhoreporteda60-minutetraveltimeto
work.Whatdoyounotice?
Thepreviousexampleillustratesanimportantweaknessofthemeanasameasureofcenter:themean
issensitivetotheinfluenceofextremeobservations.Thesemaybeoutliers,butaskeweddistribution
thathasnooutlierswillalsopullthemeantowarditslongtail.Becausethemeancannotresistthe
influenceofextremeobservations,wesaythatitisnotaresistantmeasureofcenter.
ResistantMeasure-Astatisticthatisnotaffectedverymuchbyextremeobservations.
1.3.2MeasuringCenter:TheMedian
Median-ThemedianMisthemidpointofadistribution,thenumbersuchthathalftheobservations
aresmallerandtheotherhalfarelarger.Tofindthemedianofadistribution:
1. Arrangeallobservationsinorderofsize,fromsmallesttolargest.
2. Ifthenumberofobservationsnisodd,themedianMisthecenterobservationintheordered
list.
3. Ifthenumberofobservationsniseven,themedianMistheaverageofthetwocenter
observationsintheorderedlist.
Example–TravelTimestoWorkinNorthCarolina
Findingthemedianwhennisodd
Whatisthemediantraveltimeforour15NorthCarolinaworkers?Herearethedataarrangedin
order:
51010101012152020253030404060
Thecountofobservationsn=15isodd.Thebold20isthecenterobservationintheorderedlist,with
7observationstoitsleftand7toitsright.Thisisthemedian,M=20minutes.
Example–StuckinTraffic
Findingthemedianwhenniseven
PeoplesaythatittakesalongtimetogettoworkinNewYorkStateduetotheheavytrafficnearbig
cities.Whatdothedatasay?Herearethetraveltimesinminutesof20randomlychosenNewYork
workers:
103052540201015302015208515651560604045
1.Makeastemplotofthedata.Besuretoincludeakey.
2.Findaninterpretthemedian.
1.3.3ComparingtheMeanandtheMedian
OurdiscussionoftraveltimestoworkinNorthCarolinaillustratesanimportantdifferencebetween
themeanandthemedian.Themediantraveltime(themidpointofthedistribution)is20minutes.The
meantraveltimeishigher,22.5minutes.Themeanispulledtowardtherighttailofthisright-skewed
distribution.Themedian,unlikethemean,isresistant.Ifthelongesttraveltimewere600minutes
ratherthan60minutes,themeanwouldincreasetomorethan58minutesbutthemedianwouldnot
changeatall.Theoutlierjustcountsasoneobservationabovethecenter,nomatterhowfarabovethe
centeritlies.Themeanusestheactualvalueofeachobservationandsowillchaseasinglelarge
observationupward.
Themeanandmedianofaroughlysymmetricdistributionareclosetogether.Ifthedistributionis
exactlysymmetric,themeanandmedianareexactlythesame.Inaskeweddistribution,themeanis
usuallyfartheroutinthelongtailthanisthemedian.
LeftSkewedDistributions RightSkewedDistribution
CheckYourUnderstanding
Questions1through4refertothefollowingsetting.Here,once
again,isthestemplotoftraveltimestoworkfor20randomly
selectedNewYorkers.Earlier,wefoundthatthemedianwas
22.5minutes.
1.Basedonlyonthestemplot,wouldyouexpectthemean
traveltimetobelessthan,aboutthesameas,orlargerthanthe
median?Why?
2.Useyourcalculatortofindthemeantraveltime.WasyouranswertoQuestion1correct?
3.InterpretyourresultfromQuestion2incontextwithoutusingthewords“mean”or“average.”
4.Wouldthemeanorthemedianbeamoreappropriatesummaryofthecenterofthisdistributionof
drivetimes?Justifyyouranswer.
1.3.4MeasuringSpread:TheInterquartileRange(IQR)
Ausefulnumericaldescriptionofadistributionrequiresbothameasureofcenterandameasureof
spread.
HowtoCalculateQuartilesQ1|M|Q3
1.ArrangetheobservationsinincreasingorderandlocatethemedianMintheorderedlistof
observations.
2.ThefirstquartileQ1isthemedianoftheobservationswhosepositionintheorderedlististotheleft
ofthemedian.
3.ThethirdquartileQ3isthemedianoftheobservationswhosepositionintheorderedlististothe
rightofthemedian.
InterquartileRange–IQR=Q3-Q1
Example–TravelTimestoWorkinNorthCarolina
Calculatingquartiles
OurNorthCarolinasampleof15workers’traveltimes,arrangedinincreasingorder,is
Thereisanoddnumberofobservations,sothemedianisthemiddleone,thebold20inthelist.The
firstquartileisthemedianofthe7observationstotheleftofthemedian.Thisisthe4thofthese7
observations,soQ1=10minutes(showninblue).Thethirdquartileisthemedianofthe7observations
totherightofthemedian,Q3=30minutes(showningreen).
Sothespreadofthemiddle50%ofthetraveltimesisIQR=Q3−Q1=30−10=20minutes.Besureto
leaveouttheoverallmedianMwhenyoulocatethequartiles.
Thequartilesandtheinterquartilerangeareresistantbecausetheyarenotaffectedbyafewextreme
observations
Example–StuckinTrafficAgain
FindingandinterpretingtheIQR
Findandinterprettheinterquartilerange(IQR).
1.3.5IdentifyingOutliers
Inadditiontoservingasameasureofspread,theinterquartilerange(IQR)isusedaspartofaruleof
thumbforidentifyingoutliers.
1.5*IQR–Callanobservationanoutlierifitfallsmorethan1.5xIQRabovethethirdquartileorbelow
thefirstquartile
Example–TravelTimestoworkinNewYork
IdentifyingOutliersusingthe1.5*IQRrule
Identifyanyoutliersinthedatafromthestemplot.
Q1=15minutes
Q3=42.5minutes
IQR=27.5minutes
Example–TravelTimestoWorkinNorthCarolina
IdentifyingOutliers
Determineifthetraveltimeof60minutesinthesampleof15NorthCarolinaworkersisanoutlier.
Q1=10minutes
Q3=30minutes
IQR=20minutes
1.3.6TheFive-NumberSummaryandBoxplots
Five-NumberSummary–Consistsofthesmallestobservation,thefirstquartile,themedian,thethird
quartile,andthelargestobservation,writteninorderfromsmallesttolargest.Insymbols,thefivenumbersummaryis
MinimumQ1MQ3Maximum
Thesefivenumbersdivideeachdistributionroughlyintoquarters.About25%ofthedatavaluesfall
betweentheminimumandQ1,about25%arebetweenQ1andthemedian,about25%arebetween
themedianandQ3,andabout25%arebetweenQ3andthemaximum.Thefive-numbersummaryofa
distributionleadstoanewgraph,theboxplot(akaboxandwhiskerplot).
HowtoMakeaBoxplot
1.Acentralboxisdrawnfromthefirstquartile(Q1)tothethirdquartile(Q3).
2.Alineintheboxmarksthemedian.
3.Lines(calledwhiskers)extendfromtheboxouttothesmallestandlargestobservationsthatarenot
outliers.
Example–HomeRunKing
MakingaBoxplot
BarryBondssetthemajorleaguerecordbyhitting73homerunsinasingleseasonin2001.OnAugust
7,2007,Bondshithis756thcareerhomerun,whichbrokeHankAaron’slongstandingrecordof
755.Bytheendofthe2007seasonwhenBondsretired,hehadincreasedthetotalto762.Hereare
dataonthenumberofhomerunsthatBondshitineachofhis21completeseasons:
162524193325344637334240373449734645452628
Makeaboxplotfortheabovedata,theinitialstepshavebeendonetosaveyoutime.
CheckYourUnderstanding
The2009rosteroftheDallasCowboysprofessionalfootballteamincluded10offensivelinemen.Their
weights(inpounds)were
338318353313318326307317311311
1.Findthefive-numbersummaryforthesedatabyhand.Showyourwork.
2.CalculatetheIQR.Interpretthisvalueincontext.
3.Determinewhetherthereareanyoutliersusingthe1.5×IQRrule.
4.Drawaboxplotofthedata.
1.3.7MeasuringSpread:TheStandardDeviation
Thefive-numbersummaryisnotthemostcommonnumericaldescriptionofadistribution.That
distinctionbelongstothecombinationofthemeantomeasurecenterandthestandarddeviationto
measurespread.Thestandarddeviationanditscloserelative,thevariance,measurespreadbylooking
athowfartheobservationsarefromtheirmean.Let’sexplorethisideausingasimplesetofdata.
Example–HowManyPets?
Investigatingspreadaroundthemean
Belowlistsdatadetailingthenumberofpetsownedby9children.
134445789
Themeannumberofpetsis5.Let’slookatwheretheobservationsinthedatasetarerelativetothe
mean.
Thefigureabovedisplaysthedatainadotplot,withthemeanclearlymarked.Thedatavalue1is4
unitsbelowthemean.Wesaythatitsdeviationfromthemeanis−4.Whataboutthedatavalue7?Its
deviationis7−5=2(itis2unitsabovethemean).Thearrowsinthefiguremarkthesetwodeviations
fromthemean.Thedeviationsshowhowmuchthedatavaryabouttheirmean.Theyarethestarting
pointforcalculatingthevarianceandstandarddeviation.
Thetabletotheleftshowsthedeviationfromthe
mean
foreachvalueinthedataset.Sum
thedeviationsfromthemean.Youshouldget
0,becausethemeanisthebalancepointofthe
distribution.Sincethesumofthedeviationsfromthe
meanwillbe0foranysetofdata,weneedanother
waytocalculatespreadaroundthemean.Howcanwe
fixtheproblemofthepositiveandnegativedeviations
cancelingout?Wecouldtaketheabsolutevalueof
eachdeviation.Orwecouldsquarethedeviations.For
mathematicalreasonsbeyondthescopeofthisbook,
statisticianschoosetosquareratherthantouse
absolutevalues.
Wehaveaddedacolumntothetablethatshowsthe
squareofeachdeviation
.Addupthe
squareddeviations.Didyouget52?Nowwecompute
theaveragesquareddeviation—sortof.Insteadof
dividingbythenumberofobservationsn,wedivide
byn−1:
Variance-
Thevalue6.5iscalledthevariance.
The average squared distance of the observations in a data set from their mean.
In symbols,
Becausewesquaredallthedeviations,ourunitsarein“squaredpets.”That’snogood.We’lltakethe
squareroottogetbacktothecorrectunits—pets.Theresultingvalueisthestandarddeviation:
This2.55isroughlytheaveragedistanceofthevaluesinthedatasetfromthemean.
StandardDeviation-Thestandarddeviationsxmeasurestheaveragedistanceoftheobservations
fromtheirmean.Itiscalculatedbyfindinganaverageofthesquareddistancesandthentakingthe
squareroot.Thisaveragesquareddistanceiscalledthevariance.Insymbols,thevariance isgiven
by
HowtoFindtheStandardDeviation
1. Findthedistanceofeachobservationfromthemeanandsquareeachofthesedistances.
2. Averagethedistancesbydividingtheirsumbyn−1.
3. Thestandarddeviationsxisthesquarerootofthisaveragesquareddistance:
Manycalculatorsreporttwostandarddeviations,givingyouachoiceofdividingbynorbyn−1.The
formerisusuallylabeledσx,thesymbolforthestandarddeviationofapopulation.Ifyourdataset
consistsoftheentirepopulation,thenit’sappropriatetouseσx.Moreoften,thedatawe’reexamining
comefromasample.Inthatcase,weshouldusesx.
Moreimportantthanthedetailsofcalculatingsxarethepropertiesthatdeterminetheusefulnessof
thestandarddeviation:
• sxmeasuresspreadaboutthemeanandshouldbeusedonlywhenthemeanischosenasthe
measureofcenter.
• sxisalwaysgreaterthanorequalto0.sx=0onlywhenthereisnovariability.Thishappensonly
whenallobservationshavethesamevalue.Otherwise,sx>0.Astheobservationsbecome
morespreadoutabouttheirmean,sxgetslarger.
• sxhasthesameunitsofmeasurementastheoriginalobservations.Forexample,ifyoumeasure
metabolicratesincalories,boththemeanXandthestandarddeviationsxarealsoin
•
calories.Thisisonereasontoprefersxtothevariance ,whichisinsquaredcalories.
LikethemeanX,sxisnotresistant.Afewoutlierscanmakesxverylarge.
TheuseofsquareddeviationsmakessxevenmoresensitivethanXtoafewextremeobservations.
CheckYourUnderstanding
Theheights(ininches)ofthefivestartersonabasketballteamare67,72,76,76,and84.
1.Findandinterpretthemean.
2.Makeatablethatshows,foreachvalue,itsdeviationfromthemeananditssquareddeviationfrom
themean.
3.Showhowtocalculatethevarianceandstandarddeviationfromthevaluesinyourtable.
4.Interpretthemeaningofthestandarddeviationinthissetting.
1.3.9ChoosingMeasureofCenterandSpread
Wenowhaveachoicebetweentwodescriptionsofthecenterandspreadofadistribution:the
medianandIQR,orXandsx.BecauseXandsxaresensitivetoextremeobservations,theycanbe
misleadingwhenadistributionisstronglyskewedorhasoutliers.Inthesecases,themedian
andIQR,whicharebothresistanttoextremevalues,provideabettersummary.We’llseeinthenext
chapterthatthemeanandstandarddeviationarethenaturalmeasuresofcenterandspreadforavery
importantclassofsymmetricdistributions,theNormaldistributions.
ChoosingMeasuresofCenterandSpread
ThemedianandIQRareusuallybetterthanthemeanandstandarddeviationfordescribingaskewed
distributionoradistributionwithstrongoutliers.UseXandsxonlyforreasonablysymmetric
distributionsthatdon’thaveoutliers.
Rememberthatagraphgivesthebestoverallpictureofadistribution.Numericalmeasuresofcenter
andspreadreportspecificfactsaboutadistribution,buttheydonotdescribeitsentire
shape.Numericalsummariesdonothighlightthepresenceofmultiplepeaksorclusters,forexample.
Alwaysplotyourdata.
Example-WhoTextsMore—MalesorFemales?
Pullingitalltogether
Fortheirfinalproject,agroupofAPStatisticsstudentsinvestigatedtheirbeliefthatfemalestextmore
thanmales.Theyaskedarandomsampleofstudentsfromtheirschooltorecordthenumberoftext
messagessentandreceivedoveratwo-dayperiod.Herearetheirdata: