Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Summarizing Data Graphical Methods Histogram Grouped Freq Table 8 7 6 5 4 3 2 1 0 70 to 80 80 to 90 90 to 100 100 to 110 110 to 120 120 to 130 Stem-Leaf Diagram 8 9 10 11 12 024669 04455699 224559 189 70 to 80 80 to 90 90 to 100 100 to 110 110 to 120 120 to 130 Verbal IQ Math IQ 1 1 6 2 7 11 6 4 3 4 0 1 Box-whisker Plot Example • The Baby Boom Age Distribution for Canada 1921 - 2006 1921 20.0% 18.0% 1931 1921 16.0% 20.0% 14.0% 18.0% 16.0% 12.0% 14.0% 10.0% 12.0% 8.0% 10.0% 8.0% 6.0% 6.0% 4.0% 4.0% 2.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 85 and over 65 to 74 75 to 84 85 and over 1931 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1941 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1951 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1956 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1961 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1966 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1971 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1976 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1981 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1986 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1991 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 1996 20.0% 18.0% 16.0% 14.0% 12.0% 10.0% 8.0% 6.0% 4.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 85 and over 2001 20.0% 18.0% 1931 1921 1971 16.0% 20.0% 14.0% 18.0% 16.0% 12.0% 14.0% 10.0% 12.0% 8.0% 10.0% 8.0% 6.0% 6.0% 4.0% 4.0% 2.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 85 and over 65 to 74 75 to 84 85 and over 2006 20.0% 18.0% 1931 1976 16.0% 20.0% 14.0% 18.0% 16.0% 12.0% 14.0% 10.0% 12.0% 8.0% 10.0% 8.0% 6.0% 6.0% 4.0% 4.0% 2.0% 2.0% 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 65 to 74 75 to 84 0.0% Under 5 5 to 9 10 to 14 15 to 24 25 to 34 35 to 44 45 to 54 55 to 64 85 and over 65 to 74 75 to 84 85 and over Median Age in Canada by Gender and Year Year 1921 1931 1941 1951 1956 1961 1966 1971 1976 1981 1986 1991 1996 2001 2006 Male Female 24.7 25.5 27.5 27.8 27.2 26.1 25.0 25.7 27.2 29.0 30.9 32.7 34.5 36.8 38.6 23.2 24.0 26.6 27.6 27.3 26.6 25.9 26.7 28.4 30.3 32.4 34.2 36.1 38.4 40.4 Median Age in Canada by Gender 40.0 30.0 20.0 10.0 1920 Male Female 1940 1960 1980 2000 Total Population in Canada by Year Total Population 1921 9 8,787,949 1931 # 10,376,786 1941 # 11,506,655 1951 # 14,009,429 1956 # 16,080,791 1961 # 18,238,247 1966 # 20,014,880 1971 # 21,568,310 1976 # 22,992,600 1981 # 24,343,180 1986 # 25,309,330 1991 # 27,296,855 1996 # 28,846,760 2001 # 30,007,095 2006 # 31,612,895 Total Population (Canada 35 30 25 20 15 10 5 0 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010 Summary Numerical Measures Measure of Central Location 1. Mean 2. Median Measure of Non-Central Location 1. Percentiles 2. Quartiles 1. Lower quartile (Q1) (25th percentile) (lower mid-hinge) 2. median (Q2) (50th percentile) (hinge) 3. Upper quartile (Q3) (75th percentile) (upper mid-hinge) Measure of Variability (Dispersion, Spread) 1. 2. 3. 4. Range Inter-Quartile Range Variance, standard deviation Pseudo-standard deviation 1. Range R = Range = max - min 2. Inter-Quartile Range (IQR) Inter-Quartile Range = IQR = Q3 - Q1 Example The data Verbal IQ on n = 23 students arranged in increasing order is: 80 82 84 86 86 89 90 94 94 95 95 96 99 99 102 102 104 105 105 109 111 118 119 min = 80 Q1 = 89 Q2 = 96 Q3 = 105 max = 119 Range and IQR Range = max – min = 119 – 80 = 39 Inter-Quartile Range = IQR = Q3 - Q1 = 105 – 89 = 16 3. Sample Variance Let x1, x2, x3, … xn denote a set of n numbers. Recall the mean of the n numbers is defined as: n x xi i 1 n x1 x2 x3 xn 1 xn n The numbers d1 x1 x d2 x2 x d3 x3 x d n xn x are called deviations from the the mean The sum n d i 1 n 2 i xi x 2 i 1 is called the sum of squares of deviations from the the mean. Writing it out in full: d d d d 2 1 or 2 2 2 3 x1 x x2 x 2 2 2 n xn x 2 The Sample Variance Is defined as the quantity: n d i 1 n 2 i n 1 x x i 1 2 i n 1 and is denoted by the symbol s 2 The Sample Standard Deviation s Definition: The Sample Standard Deviation is defined by: n s d i 1 n 2 i n 1 x x i 1 2 i n 1 Hence the Sample Standard Deviation, s, is the square root of the sample variance. Interpretations of s • In Normal distributions – Approximately 2/3 of the observations will lie within one standard deviation of the mean – Approximately 95% of the observations lie within two standard deviations of the mean – In a histogram of the Normal distribution, the standard deviation is approximately the distance from the mode to the inflection point Mode 0.14 0.12 Inflection point 0.1 0.08 0.06 0.04 s 0.02 0 0 5 10 15 20 25 2/3 s s 2s A Computing formula for sample variance: Sum of squares of deviations from the the mean : n x x i 1 2 i The difficulty with this formula is that x will have many decimals. The result will be that each term in the above sum will also have many decimals. The sum of squares of deviations from the the mean can also be computed using the following identity: x i n 2 i 1 xi n i 1 n n x x i 1 2 i 2 Then: n x x i 1 x i n 2 i 1 xi n i 1 n 2 i 2 x i n 2 i 1 xi n i 1 n 1 n n and s 2 x x i 1 2 i n 1 2 and x i n 2 i 1 xi n i 1 n 1 n n s x x i 1 2 i n 1 2 A quick (rough) calculation of s Range s 4 The reason for this is that approximately all (95%) of the observations are between x 2s and x 2s. Thus max x 2s and min x 2s. and Range max min x 2s x 2s . 4s Range Hence s 4 The Pseudo Standard Deviation (PSD) Definition: The Pseudo Standard Deviation (PSD) is defined by: IQR InterQuart ile Range PSD 1.35 1.35 Properties • For Normal distributions the magnitude of the pseudo standard deviation (PSD) and the standard deviation (s) will be approximately the same value • For leptokurtic distributions the standard deviation (s) will be larger than the pseudo standard deviation (PSD) • For platykurtic distributions the standard deviation (s) will be smaller than the pseudo standard deviation (PSD) Measures of Shape Measures of Shape • Skewness 0.14 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.12 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 0 5 10 15 20 25 0 5 10 15 20 25 • Kurtosis 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0 -3 -2 -1 0 1 2 3 0 0 5 10 15 20 25 -3 -2 -1 0 1 2 3 • Skewness – based on the sum of cubes n x x i 1 3 i • Kurtosis – based on the sum of 4th powers n x x i 1 4 i The Measure of Skewness n 1 3 xi x n i 1 g1 3 s The Measure of Kurtosis n 1 4 xi x n i 1 g2 3 4 s Interpretations of Measures of Shape • Skewness 0.14 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 0.12 g1 > 0 0.16 0.14 0.12 0.1 0.08 0.06 0.04 0.02 0 g1 = 0 0.1 0.08 0.06 0.04 0.02 0 0 5 10 15 20 25 0 5 10 15 20 25 g1 < 0 0 5 10 15 20 25 • Kurtosis 0.14 g2 < 0 0.12 g2 = 0 0.1 0.08 0.06 g2 > 0 0.04 0.02 0 0 -3 -2 -1 0 1 2 3 0 0 5 10 15 20 25 -3 -2 -1 0 1 2 3 Advance Box Plots • An outlier is a “wild” observation in the data • Outliers occur because – of errors (typographical and computational) – Extreme cases in the population • We will now consider the drawing of boxplots where outliers are identified To Draw a Box Plot we need to: • Compute the Hinge (Median, Q2) and the Mid-hinges (first & third quartiles – Q1 and Q3 ) • To identify outliers we will compute the inner and outer fences The fences are like the fences at a prison. We expect the entire population to be within both sets of fences. If a member of the population is between the inner and outer fences it is a mild outlier. If a member of the population is outside of the outer fences it is an extreme outlier. Inner fences Lower inner fence f1 = Q1 - (1.5)IQR Upper inner fence f2 = Q3 + (1.5)IQR Outer fences Lower outer fence F1 = Q1 - (3)IQR Upper outer fence F2 = Q3 + (3)IQR • Observations that are between the lower and upper inner fences are considered to be non-outliers. • Observations that are outside the inner fences but not outside the outer fences are considered to be mild outliers. • Observations that are outside outer fences are considered to be extreme outliers. • mild outliers are plotted individually in a box-plot using the symbol • extreme outliers are plotted individually in a box-plot using the symbol • non-outliers are represented with the box and whiskers with – Max = largest observation within the fences – Min = smallest observation within the fences Box-Whisker plot representing the data that are not outliers Extreme outlier Mild outliers Inner fences Outer fence Example Data collected on n = 109 countries in 1995. Data collected on k = 25 variables. The variables 1. Population Size (in 1000s) 2. Density = Number of people/Sq kilometer 3. Urban = percentage of population living in cities 4. Religion 5. lifeexpf = Average female life expectancy 6. lifeexpm = Average male life expectancy 7. literacy = % of population who read 8. pop_inc = % increase in popn size (1995) 9. babymort = Infant motality (deaths per 1000) 10. gdp_cap = Gross domestic product/capita 11. Region = Region or economic group 12. calories = Daily calorie intake. 13. aids = Number of aids cases 14. birth_rt = Birth rate per 1000 people 15. death_rt = death rate per 1000 people 16. aids_rt = Number of aids cases/100000 people 17. log_gdp = log10(gdp_cap) 18. log_aidsr = log10(aids_rt) 19. b_to_d =birth to death ratio 20. fertility = average number of children in family 21. log_pop = log10(population) 22. cropgrow = ?? 23. lit_male = % of males who can read 24. lit_fema = % of females who can read 25. Climate = predominant climate The data file as it appears in SPSS Consider the data on infant mortality Stem-Leaf diagram stem = 10s, leaf = unit digit 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 4455555666666666777778888899 0122223467799 0001123555577788 45567999 135679 011222347 03678 4556679 5 4 1569 0022378 46 7 8 Summary Statistics median = Q2 = 27 Quartiles Lower quartile = Q1 = the median of lower half Upper quartile = Q3 = the median of upper half 12 12 66 67 Q1 12, Q3 66.5 2 2 Interquartile range (IQR) IQR = Q1 - Q3 = 66.5 – 12 = 54.5 The Outer Fences lower = Q1 - 3(IQR) = 12 – 3(54.5) = - 151.5 upper = Q3 = 3(IQR) = 66.5 + 3(54.5) = 230.0 No observations are outside of the outer fences The Inner Fences lower = Q1 – 1.5(IQR) = 12 – 1.5(54.5) = - 69.75 upper = Q3 = 1.5(IQR) = 66.5 + 1.5(54.5) = 148.25 Only one observation (168 – Afghanistan) is outside of the inner fences – (mild outlier) Box-Whisker Plot of Infant Mortality 0 0 50 100 150 Infant Mortality 200 Example 2 In this example we are looking at the weight gains (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork). – Ten test animals for each diet Table Gains in weight (grams) for rats under six diets differing in level of protein (High or Low) and source of protein (Beef, Cereal, or Pork) High Protein Level Low protein Source Beef Cereal Pork Beef Cereal Pork Diet 1 73 102 118 104 81 107 100 87 117 111 103.0 100.0 24.0 17.78 229.11 15.14 2 98 74 56 111 95 88 82 77 86 92 87.0 85.9 18.0 13.33 225.66 15.02 3 94 79 96 98 102 102 108 91 120 105 100.0 99.5 11.0 8.15 119.17 10.92 4 90 76 90 64 86 51 72 90 95 78 82.0 79.2 18.0 13.33 192.84 13.89 5 107 95 97 80 98 74 74 67 89 58 84.5 83.9 23.0 17.04 246.77 15.71 6 49 82 73 86 81 97 106 70 61 82 81.5 78.7 16.0 11.05 273.79 16.55 Median Mean IQR PSD Variance Std. Dev. Box Plots: Weight Gains for Six Diets 130 High Protein 120 Low Protein 110 Weight Gain 100 90 80 70 60 50 Beef Cereal Pork Beef 2 3 4 Cereal Pork 40 1 Diet 5 6 Non-Outlier Max Non-Outlier Min Median; 75% 25% Conclusions • Weight gain is higher for the high protein meat diets • Increasing the level of protein - increases weight gain but only if source of protein is a meat source Multivariate Data