Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1.3.1MeasuringCenter:TheMean Mean-Thearithmeticaverage.Tofindthemean (pronouncedxbar)ofasetofobservations,add theirvaluesanddividebythenumberofobservations.Ifthenobservationsarex1,x2,…,xn,theirmean is: Or Actually,thenotation referstothemeanofasample.Mostofthetime,thedatawe’llencounter canbethoughtofasasamplefromsomelargerpopulation.Whenweneedtoreferto apopulationmean,we’llusethesymbolμ(Greeklettermu,pronounced“mew”).Ifyouhavethe entirepopulationofdataavailable,thenyoucalculateμinjustthewayyou’dexpect:addthevaluesof alltheobservations,anddividebythenumberofobservations. Example–TravelTimestoWorkinNorthCarolina Calculatingthemean Belowisdataontraveltimesof15NorthCarolinaresidents. 1)Findthemeantraveltimeforall15workers 2)Calculatethemeanagain,thistimeexcludingthepersonwhoreporteda60-minutetraveltimeto work.Whatdoyounotice? Thepreviousexampleillustratesanimportantweaknessofthemeanasameasureofcenter:themean issensitivetotheinfluenceofextremeobservations.Thesemaybeoutliers,butaskeweddistribution thathasnooutlierswillalsopullthemeantowarditslongtail.Becausethemeancannotresistthe influenceofextremeobservations,wesaythatitisnotaresistantmeasureofcenter. ResistantMeasure-Astatisticthatisnotaffectedverymuchbyextremeobservations. 1.3.2MeasuringCenter:TheMedian Median-ThemedianMisthemidpointofadistribution,thenumbersuchthathalftheobservations aresmallerandtheotherhalfarelarger.Tofindthemedianofadistribution: 1. Arrangeallobservationsinorderofsize,fromsmallesttolargest. 2. Ifthenumberofobservationsnisodd,themedianMisthecenterobservationintheordered list. 3. Ifthenumberofobservationsniseven,themedianMistheaverageofthetwocenter observationsintheorderedlist. Example–TravelTimestoWorkinNorthCarolina Findingthemedianwhennisodd Whatisthemediantraveltimeforour15NorthCarolinaworkers?Herearethedataarrangedin order: 51010101012152020253030404060 Thecountofobservationsn=15isodd.Thebold20isthecenterobservationintheorderedlist,with 7observationstoitsleftand7toitsright.Thisisthemedian,M=20minutes. Example–StuckinTraffic Findingthemedianwhenniseven PeoplesaythatittakesalongtimetogettoworkinNewYorkStateduetotheheavytrafficnearbig cities.Whatdothedatasay?Herearethetraveltimesinminutesof20randomlychosenNewYork workers: 103052540201015302015208515651560604045 1.Makeastemplotofthedata.Besuretoincludeakey. 2.Findaninterpretthemedian. 1.3.3ComparingtheMeanandtheMedian OurdiscussionoftraveltimestoworkinNorthCarolinaillustratesanimportantdifferencebetween themeanandthemedian.Themediantraveltime(themidpointofthedistribution)is20minutes.The meantraveltimeishigher,22.5minutes.Themeanispulledtowardtherighttailofthisright-skewed distribution.Themedian,unlikethemean,isresistant.Ifthelongesttraveltimewere600minutes ratherthan60minutes,themeanwouldincreasetomorethan58minutesbutthemedianwouldnot changeatall.Theoutlierjustcountsasoneobservationabovethecenter,nomatterhowfarabovethe centeritlies.Themeanusestheactualvalueofeachobservationandsowillchaseasinglelarge observationupward. Themeanandmedianofaroughlysymmetricdistributionareclosetogether.Ifthedistributionis exactlysymmetric,themeanandmedianareexactlythesame.Inaskeweddistribution,themeanis usuallyfartheroutinthelongtailthanisthemedian. LeftSkewedDistributions RightSkewedDistribution CheckYourUnderstanding Questions1through4refertothefollowingsetting.Here,once again,isthestemplotoftraveltimestoworkfor20randomly selectedNewYorkers.Earlier,wefoundthatthemedianwas 22.5minutes. 1.Basedonlyonthestemplot,wouldyouexpectthemean traveltimetobelessthan,aboutthesameas,orlargerthanthe median?Why? 2.Useyourcalculatortofindthemeantraveltime.WasyouranswertoQuestion1correct? 3.InterpretyourresultfromQuestion2incontextwithoutusingthewords“mean”or“average.” 4.Wouldthemeanorthemedianbeamoreappropriatesummaryofthecenterofthisdistributionof drivetimes?Justifyyouranswer. 1.3.4MeasuringSpread:TheInterquartileRange(IQR) Ausefulnumericaldescriptionofadistributionrequiresbothameasureofcenterandameasureof spread. HowtoCalculateQuartilesQ1|M|Q3 1.ArrangetheobservationsinincreasingorderandlocatethemedianMintheorderedlistof observations. 2.ThefirstquartileQ1isthemedianoftheobservationswhosepositionintheorderedlististotheleft ofthemedian. 3.ThethirdquartileQ3isthemedianoftheobservationswhosepositionintheorderedlististothe rightofthemedian. InterquartileRange–IQR=Q3-Q1 Example–TravelTimestoWorkinNorthCarolina Calculatingquartiles OurNorthCarolinasampleof15workers’traveltimes,arrangedinincreasingorder,is Thereisanoddnumberofobservations,sothemedianisthemiddleone,thebold20inthelist.The firstquartileisthemedianofthe7observationstotheleftofthemedian.Thisisthe4thofthese7 observations,soQ1=10minutes(showninblue).Thethirdquartileisthemedianofthe7observations totherightofthemedian,Q3=30minutes(showningreen). Sothespreadofthemiddle50%ofthetraveltimesisIQR=Q3−Q1=30−10=20minutes.Besureto leaveouttheoverallmedianMwhenyoulocatethequartiles. Thequartilesandtheinterquartilerangeareresistantbecausetheyarenotaffectedbyafewextreme observations Example–StuckinTrafficAgain FindingandinterpretingtheIQR Findandinterprettheinterquartilerange(IQR). 1.3.5IdentifyingOutliers Inadditiontoservingasameasureofspread,theinterquartilerange(IQR)isusedaspartofaruleof thumbforidentifyingoutliers. 1.5*IQR–Callanobservationanoutlierifitfallsmorethan1.5xIQRabovethethirdquartileorbelow thefirstquartile Example–TravelTimestoworkinNewYork IdentifyingOutliersusingthe1.5*IQRrule Identifyanyoutliersinthedatafromthestemplot. Q1=15minutes Q3=42.5minutes IQR=27.5minutes Example–TravelTimestoWorkinNorthCarolina IdentifyingOutliers Determineifthetraveltimeof60minutesinthesampleof15NorthCarolinaworkersisanoutlier. Q1=10minutes Q3=30minutes IQR=20minutes 1.3.6TheFive-NumberSummaryandBoxplots Five-NumberSummary–Consistsofthesmallestobservation,thefirstquartile,themedian,thethird quartile,andthelargestobservation,writteninorderfromsmallesttolargest.Insymbols,thefivenumbersummaryis MinimumQ1MQ3Maximum Thesefivenumbersdivideeachdistributionroughlyintoquarters.About25%ofthedatavaluesfall betweentheminimumandQ1,about25%arebetweenQ1andthemedian,about25%arebetween themedianandQ3,andabout25%arebetweenQ3andthemaximum.Thefive-numbersummaryofa distributionleadstoanewgraph,theboxplot(akaboxandwhiskerplot). HowtoMakeaBoxplot 1.Acentralboxisdrawnfromthefirstquartile(Q1)tothethirdquartile(Q3). 2.Alineintheboxmarksthemedian. 3.Lines(calledwhiskers)extendfromtheboxouttothesmallestandlargestobservationsthatarenot outliers. Example–HomeRunKing MakingaBoxplot BarryBondssetthemajorleaguerecordbyhitting73homerunsinasingleseasonin2001.OnAugust 7,2007,Bondshithis756thcareerhomerun,whichbrokeHankAaron’slongstandingrecordof 755.Bytheendofthe2007seasonwhenBondsretired,hehadincreasedthetotalto762.Hereare dataonthenumberofhomerunsthatBondshitineachofhis21completeseasons: 162524193325344637334240373449734645452628 Makeaboxplotfortheabovedata,theinitialstepshavebeendonetosaveyoutime. CheckYourUnderstanding The2009rosteroftheDallasCowboysprofessionalfootballteamincluded10offensivelinemen.Their weights(inpounds)were 338318353313318326307317311311 1.Findthefive-numbersummaryforthesedatabyhand.Showyourwork. 2.CalculatetheIQR.Interpretthisvalueincontext. 3.Determinewhetherthereareanyoutliersusingthe1.5×IQRrule. 4.Drawaboxplotofthedata. 1.3.7MeasuringSpread:TheStandardDeviation Thefive-numbersummaryisnotthemostcommonnumericaldescriptionofadistribution.That distinctionbelongstothecombinationofthemeantomeasurecenterandthestandarddeviationto measurespread.Thestandarddeviationanditscloserelative,thevariance,measurespreadbylooking athowfartheobservationsarefromtheirmean.Let’sexplorethisideausingasimplesetofdata. Example–HowManyPets? Investigatingspreadaroundthemean Belowlistsdatadetailingthenumberofpetsownedby9children. 134445789 Themeannumberofpetsis5.Let’slookatwheretheobservationsinthedatasetarerelativetothe mean. Thefigureabovedisplaysthedatainadotplot,withthemeanclearlymarked.Thedatavalue1is4 unitsbelowthemean.Wesaythatitsdeviationfromthemeanis−4.Whataboutthedatavalue7?Its deviationis7−5=2(itis2unitsabovethemean).Thearrowsinthefiguremarkthesetwodeviations fromthemean.Thedeviationsshowhowmuchthedatavaryabouttheirmean.Theyarethestarting pointforcalculatingthevarianceandstandarddeviation. Thetabletotheleftshowsthedeviationfromthe mean foreachvalueinthedataset.Sum thedeviationsfromthemean.Youshouldget 0,becausethemeanisthebalancepointofthe distribution.Sincethesumofthedeviationsfromthe meanwillbe0foranysetofdata,weneedanother waytocalculatespreadaroundthemean.Howcanwe fixtheproblemofthepositiveandnegativedeviations cancelingout?Wecouldtaketheabsolutevalueof eachdeviation.Orwecouldsquarethedeviations.For mathematicalreasonsbeyondthescopeofthisbook, statisticianschoosetosquareratherthantouse absolutevalues. Wehaveaddedacolumntothetablethatshowsthe squareofeachdeviation .Addupthe squareddeviations.Didyouget52?Nowwecompute theaveragesquareddeviation—sortof.Insteadof dividingbythenumberofobservationsn,wedivide byn−1: Variance- Thevalue6.5iscalledthevariance. The average squared distance of the observations in a data set from their mean. In symbols, Becausewesquaredallthedeviations,ourunitsarein“squaredpets.”That’snogood.We’lltakethe squareroottogetbacktothecorrectunits—pets.Theresultingvalueisthestandarddeviation: This2.55isroughlytheaveragedistanceofthevaluesinthedatasetfromthemean. StandardDeviation-Thestandarddeviationsxmeasurestheaveragedistanceoftheobservations fromtheirmean.Itiscalculatedbyfindinganaverageofthesquareddistancesandthentakingthe squareroot.Thisaveragesquareddistanceiscalledthevariance.Insymbols,thevariance isgiven by HowtoFindtheStandardDeviation 1. Findthedistanceofeachobservationfromthemeanandsquareeachofthesedistances. 2. Averagethedistancesbydividingtheirsumbyn−1. 3. Thestandarddeviationsxisthesquarerootofthisaveragesquareddistance: Manycalculatorsreporttwostandarddeviations,givingyouachoiceofdividingbynorbyn−1.The formerisusuallylabeledσx,thesymbolforthestandarddeviationofapopulation.Ifyourdataset consistsoftheentirepopulation,thenit’sappropriatetouseσx.Moreoften,thedatawe’reexamining comefromasample.Inthatcase,weshouldusesx. Moreimportantthanthedetailsofcalculatingsxarethepropertiesthatdeterminetheusefulnessof thestandarddeviation: • sxmeasuresspreadaboutthemeanandshouldbeusedonlywhenthemeanischosenasthe measureofcenter. • sxisalwaysgreaterthanorequalto0.sx=0onlywhenthereisnovariability.Thishappensonly whenallobservationshavethesamevalue.Otherwise,sx>0.Astheobservationsbecome morespreadoutabouttheirmean,sxgetslarger. • sxhasthesameunitsofmeasurementastheoriginalobservations.Forexample,ifyoumeasure metabolicratesincalories,boththemeanXandthestandarddeviationsxarealsoin • calories.Thisisonereasontoprefersxtothevariance ,whichisinsquaredcalories. LikethemeanX,sxisnotresistant.Afewoutlierscanmakesxverylarge. TheuseofsquareddeviationsmakessxevenmoresensitivethanXtoafewextremeobservations. CheckYourUnderstanding Theheights(ininches)ofthefivestartersonabasketballteamare67,72,76,76,and84. 1.Findandinterpretthemean. 2.Makeatablethatshows,foreachvalue,itsdeviationfromthemeananditssquareddeviationfrom themean. 3.Showhowtocalculatethevarianceandstandarddeviationfromthevaluesinyourtable. 4.Interpretthemeaningofthestandarddeviationinthissetting. 1.3.9ChoosingMeasureofCenterandSpread Wenowhaveachoicebetweentwodescriptionsofthecenterandspreadofadistribution:the medianandIQR,orXandsx.BecauseXandsxaresensitivetoextremeobservations,theycanbe misleadingwhenadistributionisstronglyskewedorhasoutliers.Inthesecases,themedian andIQR,whicharebothresistanttoextremevalues,provideabettersummary.We’llseeinthenext chapterthatthemeanandstandarddeviationarethenaturalmeasuresofcenterandspreadforavery importantclassofsymmetricdistributions,theNormaldistributions. ChoosingMeasuresofCenterandSpread ThemedianandIQRareusuallybetterthanthemeanandstandarddeviationfordescribingaskewed distributionoradistributionwithstrongoutliers.UseXandsxonlyforreasonablysymmetric distributionsthatdon’thaveoutliers. Rememberthatagraphgivesthebestoverallpictureofadistribution.Numericalmeasuresofcenter andspreadreportspecificfactsaboutadistribution,buttheydonotdescribeitsentire shape.Numericalsummariesdonothighlightthepresenceofmultiplepeaksorclusters,forexample. Alwaysplotyourdata. Example-WhoTextsMore—MalesorFemales? Pullingitalltogether Fortheirfinalproject,agroupofAPStatisticsstudentsinvestigatedtheirbeliefthatfemalestextmore thanmales.Theyaskedarandomsampleofstudentsfromtheirschooltorecordthenumberoftext messagessentandreceivedoveratwo-dayperiod.Herearetheirdata: