Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Describing Distributions • • • • • Measures of location Measures of spread Unusual observations Robust measures Shape of the distribution Measures of centre •Mean •Median Median •The median M is the midpoint of a distribution, the number such that half the observations are smaller and the other half are larger. •Procedure: order the data , count until middle Median - examples 39 41 38 42 340 n=5 Order: 38 39 41 42 340 Median is 3rd number in list - 41 39 41 38 42 n=4 Order: 38 39 41 42 Median is average of 2nd and 3rd numbers in list - 40 Mean • If n observations are denoted by x1, x2, x3,…xn, their sample mean is 1 x ( x1 x2 xn ) n Mean - examples • Sample mean of 39, 41, 38, 42 is (39 + 41 + 38 + 42)/4 = 40 Median = 40 • Sample mean of 39, 41, 38, 42, 340 is (39 + 41 + 38 + 42+340)/5 = 100 Median = 41 Measures of spread • Sample variance and sample standard deviation • First and third quartiles and interquartile range • Range Quartiles Q1: First Quartile Median of the observations less than the median Q3: Third Quartile Median of observations greater than the median Example of Quartiles Data: 1 2 3 4 5 6 7 8 9 10 11 12 13 140 200 Median = Q1 = Q3 = Interquartile Range (IQR) IQR = Q3 – Q1 Previous example: Sample Variance • The sample variance is the mean squared deviation from the mean. The sample variance of n observations x1, x2, x3, …xn is n 1 2 s ( xi x) n 1 i 1 2 Sample standard deviation • The sample standard deviation s is the square root of the variance • It has the same unit of measurement as the original observations Variance and standard deviation Data Square of deviation Deviation from mean -2 -1 1 2 38 39 41 42 Mean = 40 Variance Std dev 4 1 1 4 10 3.33 1.83 Variance and standard deviation Square of Deviation deviation from mean 3844 -62 38 3721 -61 39 3481 -59 41 3364 -58 42 57600 240 340 72010 Mean = 100 18002.50 Variance 134.17 Std dev Data Example: Find the sample mean and sample standard deviation for the second set of rainfall measurements. xi 31 35 36 30 37 35 x xi x ( xi x ) 2 Example: Find the sample mean and sample standard deviation for the second set of rainfall measurements. xi xi x ( xi x ) 2 31 35 36 30 37 35 -3 1 2 -4 3 1 0 9 1 4 16 9 1 40 x 34 s2=40/(6-1)=8, s=2.8 As we suspected the variability for the second lot of rainfall figures was lower than for the first. The 5-number Summary Minimum, Q1, Median, Q3, Maximum Represent this graphically by a boxplot Boxplot • A central box spans the quartiles • A line in the box marks the median • Observations more than 1.5 IQR outside the central box are plotted individually as possible outliers • Lines extend from the box out to the smallest and largest observations that are not suspected outliers Example of a Boxplot The Pulse Data 100 Pulse1 90 80 70 60 50 Outliers on boxplot • inner fence • outer fence Q1 1.5 IQR Q3 + 1.5 IQR Q1 3 IQR Q3 + 3 IQR • Plot between fence values as * (possible outlier) • and outside outer fence values as (probable outliers) Robust (resistant) statistic • Outlier is a value outside the usual range • Robust (resistant) statistic is not much affected by outliers • Unaffected: median, quartiles, IQR • Affected: mean • Most affected: standard deviation Example - a weeks travelling times on the Met • TIMES 36 29 29 • TIMES2 29 36 29 35 184 30 34 31 35 40 30 34 31 30 34 30 34 Example: a weeks travelling times on the MET Variable TIMES TIMES2 Mean 47.2 32.8 Median 32.5 32.50 Std dev 48.1 3.61 Q1 29.8 29.75 Q3 35.2 35.25 The Pulse Data Students in an introductory statistics course participated in a simple experiment. Each student recorded his or her height, weight, gender, smoking preference, usual activity level, and resting pulse. Then they all flipped coins, and those whose coins came up heads ran in place for one minute. Then the entire class recorded their pulses once more Column Name Description A Pulse1 First pulse rate B Pulse2 Second pulse rate C Ran D Smokes 1 = smokes regularly, 2 = does not regularly E Sex F Height: Height in inches G Weight Weight in pounds H Activity: Usual level of physical activity: 1 = ran in place, 2 = did not run in place 1 = male, 2 = female 1 = slight 2 = moderate 3 = a lot Comparing distributions with boxplots What is the difference between males and females? First, find the five-number summary for Pulse1 for males and females separately, by first sorting the Pulse1 data into sex order, recalling that 1=male 2=female. Sex 1 = male 2 = female Min 48 58 Q1 63 66 M 70 78 Q3 75 86 Max 92 100 Using these five-number summaries, we can easily construct side-by side comparative boxplots. 100 Pulse1 90 80 70 60 50 1 2 Sex Note how easy it is to see the higher median value for females and the greater spread. Let’s compare the pulse-rates of those who ran and those who didn’t. This time we use the Pulse2 measurements, and sort them according to the value of Ran . 140 130 120 Pulse2 110 100 90 80 70 60 50 1 2 Ran Note: Not only is the median higher, amongst those who ran, but the spread is much greater also. Why? STUDENT EXAMPLES Example 1: Produce a boxplot for the following data: 27, 4, 13, 12, 35, 19, 33, 26, 35, 3, 41, 31, 42 Q1=13 M=27 Q3=35 Example 1: Produce a boxplot for the following data: 27, 4, 13, 12, 35, 19, 33, 26, 35, 3, 41, 31, 42 Q1=13 M=27 Q3=35 Outliers: 1.5 x IQR = 33, 13-33 = -20, 35+33 = 68 Outside [-20,68] 40 C9 30 20 10 0 What should one do about outliers? Example: Newcomb’s measurements of the passage time of light. What variable is being measured? Newcomb measured how long light took to travel from his laboratory on the Potomac River to a mirror at the base of the Washington Monument and back, a total distance of about 7400 metres. Newcomb computed the speed of light from the travel time. What are the units of measurement? Newcomb’s first measurement of the passage of time of light was 0.000024828 second. So his unit of measurement was seconds. How are the data recorded? The entries in the table look nothing like 0.000024828. Such numbers are awkward to write and to do arithmetic with. We therefore move the decimal point nine places to the right, giving 24828, and then record only the deviation from 24800. The table entry 28 is short for the original 0.000024828, and the entry -2 stands for 0.000024798. This is called coding the data. Time -44 23 25 27 28 31 36 -2 23 25 27 29 32 36 16 23 25 27 29 32 36 16 24 26 27 29 32 37 19 24 26 28 29 32 39 20 24 26 28 29 32 40 21 24 26 28 30 33 21 24 26 28 30 33 22 25 27 28 30 34 22 25 27 28 31 36 40 30 20 Time 10 0 -10 -20 -30 -40 -50 Newcomb decided to leave in the –2 in his estimate of the speed of light but remove the – 44. In fact both values are considered outliers by the definitions above. Further, we expect “errors” to be symmetric, and they are, once both values are excluded. Linear transformations of a variable Example: The mean maximum daily temperature in March is 25ºC with standard deviation 3ºC ºF = 32 + 1.8 ºC Hence the mean maximum daily temperature in March is 32 + 1.8 25 =77 ºF with standard deviation 1.8 3=5.4 ºF Linear transformations of a variable Example: Distances and costs for taxis Distances (km) 8 10 12 13 15 Flagfall $2 and $1.50 per km Cost = 2 + 1.5 Distance Costs ($) 14 17 20 21.5 24.5 Mean distance 11.6km Mean cost $(2 + 1.5 11.6) Linear transformation of a variable If y = a + bx then • sy = |b|sx • My = a + bMx (M is the median) • If b > 0, (Q1)y = a + b(Q1)x (Q3)y = a + b(Q3)x y a bx