Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Mean field particle methods wikipedia , lookup
Taylor's law wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Categorical variable wikipedia , lookup
1.2 Describing distributions with numbers (p28) Measuring center: mean Two common measures of center are the mean and the median. The two measures behave differently. Example Find the mean of the following observations. 4, 5, 9, 3, 5 Solution: mean 4 6 9 3 6 28 5.6 5 5 If there are n observations x1, x2,, xn in a sample, the sample mean (denoted by x ) is given by sum of xi ' s xi x . n n 1 Example The annual salaries (in thousands) of a random sample of five employees of a company are: 40, 30, 25, 200, 28 mean 40 30 25 200 28 323 64.6 5 5 If we exclude 200 as an outlier, mean 40 30 25 28 123 30.75 4 4 Mean is sensitive to the influence of extreme observations. It cannot resist influence of the extreme values. Mean is NOT a resistant measure of center. (p31) 2 Measuring center: the median (p31) The median (M) is the midpoint of the distribution, the number such that half the observations are smaller and other the half are larger. To find the median of a distribution: 1.Arrange all observations in order of size, from smallest to largest. 2.If the number of observations is odd the median is the center observation in the ordered list. 3.If the number of observations is even the median is the average of the two center observations in the ordered list. Examples 1.The annual salaries (in thousands) of a random sample of five employees of a company are: 40, 30, 25, 200, 28 Arranging the values in increasing order: 25 28 30 40 200 3 median = 30 Excluding 200 median = (28+30)/2. Note that the mean for this data set was 64.6 and the influence of the extreme value 200 is much less. StatCrunch commands Stat > Summary Statistics StatCrunch output for the data in Example above is as follows: Summary statistics: Column salary n 5 Mean 64.6 Median 30 4 Mean versus median (p32) The median and mean are the most common measures of the center of a distribution. If the distribution is exactly symmetric, the mean and median are exactly the same. Median is less influenced by extreme values. If the distribution is skewed to the right, mode < median < mean If the distribution is skewed to the left, mean < median < mode. 5 Examples The distribution of Co2 (Table 1.3 p26) Variable: CO2 Decimal point is 1 digit(s) to the right of the colon. 0 : 00000000001111111122233344444 0 : 555677888999 1 : 0001 1 : 67 2:0 Summary statistics: Column n Mean CO2 48 4.5958333 Median 3.2 Min Max 0 19.9 6 Distribution of a simulated data set (100 values) Variable: x Decimal point is 1 digit(s) to the right of the colon. 0:7 1: 1: 2:1 2: 3: 3: 4:1 4:5 5 : 012344 5 : 88 6:4 6 : 55689 7 : 1344 7 : 5567889 8 : 00011112223 8 : 555566666678888899 9 : 000001112333334 9 : 555555666778888888889999999 Summary statistics: Column n Mean x Median Min Max 100 82.23267 86.8068 7.228082 99.321556 7 Questions 1.You are asked to recommend a measure of center to characterize the following data: 0.6, 0.2, 0.1, 0.2, 0.2, 0.3, 0.7, 0.1, 0.0, 22.5, 0.4. What is your recommendation and why? 2.The mean is ____ sensitive to extreme values than the median a) more b) less c) equally d) can’t say without data 3.Changing the value of a single score in a data set will necessarily cause the mean to change. (T/F) 4. Changing the value of a single score in a data set will necessarily cause the median to change. (T/F) 8 Measuring Spread The range (max-min) is a measure of spread but it is very sensitive to the influence of extreme values. Measuring spread: interquartile range (IQR) p38 Quartiles: p33) The first quartile (Q1) is the median of the observations whose position in the ordered list is to the left of the median of the overall median. The 3rd quartile (Q3) is the median of the observations whose position in the ordered list is to the right of the median of the overall median. i.e. IQR Q3 Q1 9 Example The highway mileages of 18 cars, arranged in increasing order are: 13 13 16 19 21 21 23 23 24 26 26 27 27 27 28 28 30 30 n = 18 (n is even), n1 18 1 9.5 and so 2 2 the median is the average of 9th and 10th values in the above ordered data set = 24 26 25 . 2 th Q .5 value = 21. 1 th Q 5 value from the upper end = 27. 3 IQR Q Q 27 21 6 . 3 1 10 The five-number summary p36 The five-number summary of a set of observations consists of the minimum, the first quartile, median, the third quartile and the maximum. These five numbers give a quick summary of the both center and the spread of the distribution. StatCrunch commands:Stat > Summary Statistics Example The highway mileages of 18 cars, arranged in increasing order are: 13 13 16 19 21 21 23 23 24 26 26 27 27 27 28 28 30 30 Give the five number summary. Ans: min = 13, first quartile = 21, median = 25, third quartile = 27 , max. = 30. 11 The StatCrunch output using the above commands is as follows: Summary statistics: Column n Mean mileage 18 23.444445 Median Min Max Q1 Q3 25 13 30 21 27 Boxplot p36 A boxplot is a graph of the five-number summary. Example: Make a boxplot for the data in the above example. 12 StatCrunch commands: Graphics > Boxplot 13 1.5 IQR rule for outliers (p37) Strength 0, 0, 550, 750, 950, 950, 1150, 1150, 1150, 1150, 1150, 1250, 1250, 1350, 1450, 1450, 1450, 1550, 1550, 1550 1850, 2050, 3150 Summary statistics: Column strength n 23 Mean 1254.3478 Median 1250 Min Max 0 3150 Q1 Q3 950 1550 Range 3150 IQR = 1550 – 950 = 600 1.5 IQR = 900 Q3 + 1.5 IQR = 1550 + 900 = 2450 Q1 - 1.5 IQR = 950 – 900 = 50 14 15 Side-by-side boxplots for comparison Example Consider Ex1.41 p26 Summary statistics for StudyTime: Group by: Gender Gender n Mean Min Max Q1 Q3 IQR F 30 165.16667 60 360 120 180 60 M 30 117.166664 0 300 60 150 90 16 Measuring spread Standard deviation (p39) The variance ( s2 ) of a set of n observations x , x ,, xn is 1 2 2 ( x x)2 ( x x)2 2 ( x x ) ( x x ) n i 2 . s2 1 n1 n1 The standard deviation(s) is the square root of the variance ( s2 ). i.e. ( x1 x)2 ( x2 x)2 ( xn x)2 ( xi x)2 s n1 n1 Example Find the standard deviation of the following data set: 5, 8, 2 n 3, Mean ( x ) = 5 8 2 15 5 3 3 2 (85)2 (25)2 18 (5 5) 2 s 9 31 2 s 9 3. 17 Example Consider Ex1.41 p26 again Summary statistics for StudyTime: Group by: Gender Gender n Mean F 30 M 30 117.166664 Std. Dev. Median Range Min Max Q1 Q3 Variance IQR 165.16667 56.514927 74.23963 175 300 60 120 300 0 360 120 180 3193.9368 60 300 90 60 150 5511.523 Properties of the standard deviation(s), p41 s measures the spread about the mean x . s = 0 only when there is no spread. This happens only when all observations have the same value. s, like the mean x , is not resistant. A few outliers can make s very large. 18 Choosing a summary p42 The five-number summary is usually better than the mean and the standard deviation for describing skewed distributions or distributions with strong outliers. Use mean and std. deviation for reasonably symmetric distributions that are free of outliers. 19 Effect of a Linear Transformation p43 •Multiplying each observation in a data set by a number b multiplies the mean, median, by b and the measures of spread (standard deviation, IQR) by abs(b) . •Adding the same number a to each observation in a data set adds a to measures of center, quartiles, percentiles but does not change the measures of spread. 20