Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STAB22 Statistics I Lecture 4 1 Spread of Distribution Spread describes how much data vary about their centre (a.k.a. dispersion or variability) Spread is an important aspect of a distribution 150 Investment 2 0 0 50 100 Frequency 150 Investment 1 100 200 200 E.g. Investments below have same mean return (3%) but different spread; which do you consider safer? 50 Frequency -10 -5 0 5 % return 10 15 -10 -5 0 5 % return 10 15 2 Range Range = Maximum – Minimum value Simple to calculate, but not always helpful E.g. Data: 4, 6, 8, 8, 10, 15 → Range =15 − 4= 11 Data set 1: 4, 4, 4, 4, 4, 4, 10 Data set 2: 4, 5, 6, 7, 8, 9, 10 same range, but not same spread!! Range is very sensitive to outliers Depends only on extreme values (min & max) Better alternative is Interquartile Range 3 Quartiles Quartiles: 3 values that divide distribution into 4 parts, each containing ¼ of data 1st quartile (Q1) has ¼ of data below & ¾ above 2nd quartile (Q2) has ½ of data below & ½ above Q2 =? 3rd quartile (Q3) has ¾ of data below & ¼ above ¼ ¼ ¼ ¼ of data 4 Q1 Q2 Q3 Quartiles To find Q1 & Q3, split data in two halves & calculate median of each half When # data is odd (n=5, 7,…) include median value in each half E.g. Find Q1 & Q3 for following data sets: (1 3 4 6 7 9) (1 3 4 4 6 7 9) 5 Interquartile Range IQR (Interquartile Range) = Q3 – Q1 Distance between Q1 and Q3 Resistant (not sensitive) to outliers IQR Q1 Q3 E.g. Data 3, 5, 7, 7, 8 → IQR = 6 Variance Variance (s2): measures average squared deviation of individual data from their mean For values x1, x2, …, xn, variance is given by: 2 s 2 x1 x x2 x 2 n 1 xn x 2 x x i n 1 Units of variance are squared → measuring spread in $² or in days² !! Easy fix: take square root (√) of s2 to correct units 7 2 Standard Deviation Standard Deviation (s): s Standard Deviation = 2 s Variance Both variance (s2) & st. dev. (s) measure spread around mean ( x ) They are sensitive to outliers; they work best for symmetric distributions Nevertheless, they are commonly used in Stats because they have useful properties 8 Example 2 3 5 6 9 Find standard deviation of data 2, 3, 5, 6 & 9. Data Deviation ( xi ) ( xi x ) Deviation² ( xi x ) 2 2 3 5 6 9 Total 25 9 Five-Number Summary Five-number summary: set of five measures giving quick summary of a distribution Measures consist of: (1) Minimum (2) Q1 (3) Median (4) Q3 and (5) Maximum Gives an idea of both center & spread E.g. For data (10, 11, 14, 15, 17, 19, 21, 28, 35), 5-number summary is: Min Q1 10 Median 14 17 Q3 Max 21 StatCrunch: Stat > Summary Stats 35 10 Boxplot Boxplot: visual display of 5-number summary E.g. Cereal sodium content Maximum Q3 IQR Median Summary statistics: Column Min Q1 sodium 130 0 Median 180 Q3 210 Max 320 Q1 Minimum StatCrunch: Graphics > Boxplot 11 Boxplot & Outliers Q1 Q3 130 210 IQR Boxplot helps identify outliers using “fences” Let: Draw lines from box to farthest value within fences Plot values outside fences individually upper fence Upper fence Q3 1.5 IQR Lower fence Q1 1.5 IQR 80 These are suspect outliers that require examination StatCrunch: Graphics > Boxplot (tick “use fences to identify outliers”) lower fence suspect outlier 12 Example Cereal calorie content Summary statistics: Column calories Min 50 Q1 100 Median Q3 110 110 Max 160 13 Choosing a Summary Mean and standard deviation work well for reasonably symmetric distributions w/o outliers Five-number summary is better for describing skewed distributions w/ outliers In addition to numerical summaries, always try to include a plot of the distribution (e.g. boxplot or histogram) 14