Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
There are several TYPES of variables that reflect characteristics of the data Ratio Interval Ordinal Nominal Ratio scale constant size interval between adjacent values on the measurement scale existence of a meaningful zero point Interval scale constant size interval between adjacent values on the measurement scale no true zero value N W E S 10 0 -10 Ordinal scale data that convey only relative magnitude Dark Medium Tall Medium Short Light Nominal scale data in which there is no meaningful numerical information Single Married Divorced Widowed Another useful classification Continuous data can take-on any value Eg height 150 to 210cm range Bill - 174.25 cm Discrete data can take-on only certain values Eg # of hands 0 to 3 range Bill - 2 hands 2 more important issues with data Accuracy how close is a measured value to the real value Precision how close repeated measurements are to one another Let’s say Bill’s real height is 174.25 cm. Accurate Precise Not Accurate Not Precise 174.25 172 174.25 178 174.25 171 174.25 174 174.25 182 174.25 168 Not Accurate Precise 170.25 170.25 170.25 170.25 170.25 170.25 Frequency Distribution occurrence of the various values observed for the variable raw frequency counts relative frequency counts divided by total number of observations Name Height (cm) Hair Colour Anne 168 Brown Rishi 178 Black Bill 183 Brown Cristin 172 Brown Rich 175 Black Variable: Hair Colour Sample size = 5 Frequency of Black Hair = 2 Frequency of Brown Hair = 3 Must add to 5 Relative Frequency of Black Hair = 2/5 = 0.4 Relative Frequency of Brown Hair = 3/5 = 0.6 Must add to 1 Variable: Height Sample size = 5 Frequency Frequency Frequency Frequency Frequency Relative Relative Relative Relative Relative of of of of of 168 172 175 178 183 cm = 1 cm = 1 cm = 1 cm = 1 cm = 1 Frequency Frequency Frequency Frequency Frequency of of of of of 168 172 175 178 183 cm = cm = cm = cm = cm = 1/5 1/5 1/5 1/5 1/5 = 0.2 = 0.2 = 0.2 = 0.2 = 0.2 Make categories Eg. Number above and number below midpoint of range Range: Maximum - Minimum 183 cm - 168 cm = 15 cm Mid-point: half way between Min and Max = Min + (Range / 2) = 168 cm + 7.5 cm = 175.5 cm Frequency of Heights Below 175.5 cm = 3 Frequency of Heights Above 175.5 cm = 2 Relative Frequency of Heights Below 175.5 cm = 3/5 = 0.6 Relative Frequency of Heights Above 175.5 cm = 2/5 = 0.4 Could make THREE categories Divide range by 3: 15 cm / 3 = 5 cm Category 1: 168 cm to 168 cm + 5 cm 168 cm to 173 cm Category 2: 174 cm to 174 cm + 5 cm 174 cm to 179 cm Category 3: 180 cm to 180 cm + 5 cm 180 cm to 185 cm Frequency of Heights in 168 cm to 172 cm = 2 Frequency of Heights in 173 cm to 178 cm = 2 Frequency of Heights in 179 cm to 184 cm = 1 Relative Frequency of Heights in 168 cm to 172 cm = 2/5 = 0.4 Relative Frequency of Heights in 173 cm to 178 cm = 2/5 = 0.4 Relative Frequency of Heights in 179 cm to 184 cm = 1/5 = 0.2 Mother’s age and babies birth weight data from Massachusetts 19 33 20 21 18 21 22 17 29 26 19 19 22 30 18 18 15 25 20 28 32 31 36 28 25 28 17 29 26 17 17 24 35 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 2722 2733 2750 2750 2769 2769 2778 2782 2807 2821 2835 2835 2836 2863 2877 2877 2906 2920 2920 2920 2920 2948 2948 25 25 29 19 27 31 33 21 19 23 21 18 18 32 19 24 22 22 23 22 30 19 16 21 30 20 17 17 23 24 28 26 20 2977 2977 2977 2977 2992 3005 3033 3042 3062 3062 3062 3076 3076 3080 3090 3090 3090 3100 3104 3132 3147 3175 3175 3203 3203 3203 3225 3225 3232 3232 3234 3260 3274 24 28 20 22 22 31 23 16 16 18 25 32 20 23 22 32 30 20 23 17 19 23 36 22 24 21 19 25 16 29 29 19 19 3274 3303 3317 3317 3317 3321 3331 3374 3374 3402 3416 3430 3444 3459 3460 3473 3475 3487 3544 3572 3572 3586 3600 3614 3614 3629 3629 3637 3643 3651 3651 3651 3651 30 24 19 24 23 20 25 30 22 18 16 32 18 29 33 20 28 14 28 25 16 20 26 21 22 25 31 35 19 24 45 28 29 3699 3728 3756 3770 3770 3770 3790 3799 3827 3856 3860 3860 3884 3884 3912 3940 3941 3941 3969 3983 3997 3997 4054 4054 4111 4153 4167 4174 4238 4593 4990 709 1021 34 25 25 27 23 24 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19 26 24 17 20 22 27 20 17 25 20 18 18 1135 1330 1474 1588 1588 1701 1729 1790 1818 1885 1893 1899 1928 1928 1928 1936 1970 2055 2055 2082 2084 2084 2100 2125 2126 2187 2187 2211 2225 2240 2240 2282 2296 20 21 26 31 15 23 20 24 15 23 30 22 17 23 17 26 20 26 14 28 14 23 17 21 2296 2301 2325 2353 2353 2367 2381 2381 2381 2395 2410 2410 2414 2424 2438 2442 2450 2466 2466 2466 2495 2495 2495 2495 Range of the Birth Weight data: Minimum: 709 g Maximum: 4990 g Difference: 4281 g Let’s say we want to look at the distribution of data across 10 categories. Each category would span 428.1 g, but for convenience we’ll round to 430 g. Also, instead of starting our first category at 709 g we’ll use 700g Category 1 2 3 4 5 6 7 8 9 10 Range Freq. 3 700-1130 3 1131-1560 1561-1990 14 1991-2420 29 2421-2850 34 2851-3280 44 3281-3710 33 3711-4140 23 4141-4750 4 4751-5000 2 Rel. Freq. 0.015873016 0.015873016 0.074074074 0.153439153 0.17989418 0.232804233 0.174603175 0.121693122 0.021164021 0.010582011 Previous breakdown ok as long as I have measured weight to the nearest gram. gram BUT, if I’ve measure to the nearest 0.1 --> my categories may miss some observations So need to adjust… Category 1 2 3 4 5 6 7 8 9 10 Range 700-1130 1131-1560 1561-1990 1991-2420 2421-2850 2851-3280 3281-3710 3711-4140 4141-4750 4751-5000 Measured to the nearest gram Range 700-1130.9 1131-1560.9 1561-1990.9 1991-2420.9 2421-2850.9 2851-3280.9 3281-3710.9 3711-4140.9 4141-4750.9 4751-5000 .9 Measured to the nearest 0.1 gram Histogram - graphical representation of a frequency distribution 3 2.5 2 1.5 1 0.5 0 Brown Hair Black Hair Hair colour Frequency distribution of neonatal birth weight 50 40 30 20 10 0 1 2 3 4 5 6 7 8 9 Birth Weight Category 10 Frequency distribution of neonatal birth weight 0.25 0.2 0.15 0.1 0.05 0 1 2 3 4 5 6 7 8 9 Birth Weight Category 10 Category 1 2 3 4 5 6 7 8 9 10 Range 700-1130 1131-1560 1561-1990 1991-2420 2421-2850 2851-3280 3281-3710 3711-4140 4141-4750 4751-5000 Mid-point 915 1346 1776 2206 2636 3066 3496 3926 4356 4966 Frequency distribution of neonatal birth weight 91 5 13 46 17 76 22 06 26 36 30 66 34 96 39 26 43 56 49 66 50 40 30 20 10 0 Birth Weight Category Mid-point Cumulative Frequency - Cum. Freq. at any category is equal to the frequency at that category plus the frequency in each previous category. Category 1 2 3 4 5 6 7 8 9 10 Range 700-1130 1131-1560 1561-1990 1991-2420 2421-2850 2851-3280 3281-3710 3711-4140 4141-4750 4751-5000 Freq. 3 3 14 29 34 44 33 23 4 2 Rel. Freq. 0.0158 0.0158 0.07407 0.15343 0.17989 0.23280 0.17460 0.12169 0.02116 0.01058 Cum. Freq. 0.0158 0.0317 0.1058 0.2592 0.4391 0.6719 0.8465 0.9682 0.9894 1.0 Frequency distribution of neonatal birth weight 1.2 1 0.8 0.6 0.4 0.2 0 1 2 3 4 5 6 7 8 9 Birth Weight Category 10 Measures of Central Tendency These generally tell you where the majority of the observations lie Each one tells something slightly different Mean Median Mode Average Middle Value Most Frequent Value The Mean: The mean is calculated by summing the observed values and dividing the sum by the total number of observations. Population Mean = μ Sample Mean = X A die has 6 sides, 1 dot, 2, 3, 4, 5, and 6 1 2 3 4 5 6 3.5dots 6 2 3 4 X 3dots 3 X 1 X 2 X 3 ... X N N X 1 X 2 X 3 ... X n X n Xi i 1 N N Xi i 1 X n n Rishi Anne Bill Cristin Rich Observation i Height Xi 1 2 3 4 5 172 185 132 191 205 n=5 = 885 X ' s 885 X 177 n 5 19 33 20 21 18 21 22 17 29 26 19 19 22 30 18 18 15 25 20 28 32 31 36 28 25 28 17 29 26 17 17 24 35 2523 2551 2557 2594 2600 2622 2637 2637 2663 2665 2722 2733 2750 2750 2769 2769 2778 2782 2807 2821 2835 2835 2836 2863 2877 2877 2906 2920 2920 2920 2920 2948 2948 25 25 29 19 27 31 33 21 19 23 21 18 18 32 19 24 22 22 23 22 30 19 16 21 30 20 17 17 23 24 28 26 20 2977 2977 2977 2977 2992 3005 3033 3042 3062 3062 3062 3076 3076 3080 3090 3090 3090 3100 3104 3132 3147 3175 3175 3203 3203 3203 3225 3225 3232 3232 3234 3260 3274 24 28 20 22 22 31 23 16 16 18 25 32 20 23 22 32 30 20 23 17 19 23 36 22 24 21 19 25 16 29 29 19 19 3274 3303 3317 3317 3317 3321 3331 3374 3374 3402 3416 3430 3444 3459 3460 3473 3475 3487 3544 3572 3572 3586 3600 3614 3614 3629 3629 3637 3643 3651 3651 3651 3651 30 24 19 24 23 20 25 30 22 18 16 32 18 29 33 20 28 14 28 25 16 20 26 21 22 25 31 35 19 24 45 28 29 3699 3728 3756 3770 3770 3770 3790 3799 3827 3856 3860 3860 3884 3884 3912 3940 3941 3941 3969 3983 3997 3997 4054 4054 4111 4153 4167 4174 4238 4593 4990 709 1021 34 25 25 27 23 24 24 21 32 19 25 16 25 20 21 24 21 20 25 19 19 26 24 17 20 22 27 20 17 25 20 18 18 1135 1330 1474 1588 1588 1701 1729 1790 1818 1885 1893 1899 1928 1928 1928 1936 1970 2055 2055 2082 2084 2084 2100 2125 2126 2187 2187 2211 2225 2240 2240 2282 2296 20 21 26 31 15 23 20 24 15 23 30 22 17 23 17 26 20 26 14 28 14 23 17 21 2296 2301 2325 2353 2353 2367 2381 2381 2381 2395 2410 2410 2414 2424 2438 2442 2450 2466 2466 2466 2495 2495 2495 2495 n = 189 189 X i 1 i 556540 189 n = 189 X i 1 i 556540 X ' s 556540 X 2944.656 n 189 Another way to calculate the mean Suppose you had a frequency distribution for the number of cancerous moles on people who regularly visit Club Med # cancerous moles (X) Frequency (f) 0 1 2 3 4 5 8 4 8 10 2 1 # cancerous moles (x) Frequency (f) 0 1 2 3 4 5 8 4 8 10 2 1 0 4 16 30 8 5 n = 33 f*x = 63 n = f’s X’s = f*x f*x f * x 63 X 1.909 f 33 The Mode: the most frequently occurring value in a set of measurements Frequency distribution of neonatal birth weight 50 40 30 20 10 0 1 2 3 4 5 6 7 8 Birth Weight Category 9 10 Category 1 2 3 4 5 6 7 8 9 10 Range 700-1130 1131-1560 1561-1990 1991-2420 2421-2850 2851-3280 3281-3710 3711-4140 4141-4750 4751-5000 Freq. 3 3 14 29 34 44 33 23 4 2 Rel. Freq. 0.015873016 0.015873016 0.074074074 0.153439153 0.17989418 0.232804233 0.174603175 0.121693122 0.021164021 0.010582011 Mid-point is 3065.5 --> report the MODE as 3065.5 The Median: the middle measurement of a set of data --> data must be ordered Observation (X) 1 2 3 4 5 6 7 8 9 Heights (cm) Ordered Heights (cm) 178 123 143 143 123 168 189 173 187 178 205 187 168 189 173 198 198 205 Median is 178 cm Observation (X) 1 2 3 4 5 6 7 8 9 10 Heights (cm) Ordered Heights (cm) 178 123 143 143 123 162 189 168 187 173 205 178 168 187 173 189 198 198 162 205 Middle observation is 5.5 --> median is midway between observation 5 and observation 6 Median is (173+178)/2 = 175.5 General formula for Median: If n is an odd number: X ( n 1) / 2 X ( 91) / 2 X ( 5) 178 General formula for Median: If n is an even number: X ( n 1) / 2 X (101) / 2 X ( 5. 5 ) X5 X6 2 173 178 175.5 2 # cancerous moles (X) Frequency (f) 0 1 2 3 4 5 8 4 8 10 2 1 M = X(n+1)/2=X17=2 Cumulative Frequency 8 12 20 30 32 33 0 0 0 0 0 0 0 0 1 1 1 1 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 4 5 Category 1 2 3 4 5 6 7 8 9 10 Range 700-1130 1131-1560 1561-1990 1991-2420 2421-2850 2851-3280 3281-3710 3711-4140 4141-4750 4751-5000 M = X(n+1)/2 = X190/2 = X95 Freq. 3 3 14 29 34 44 33 23 4 2 Cum. Freq. 3 6 20 49 83 127 160 183 187 189 Of the previous class Median = (lower limit of class) + ((0.5*n - cum.freq.)/#obs in interval)(interval size) = 2851 + ((0.5*189- 83)/44) * (430) = 2851 + (94.5-83)/44 *430 = 2963.4 Frequency distribution of neonatal birth weight 50 40 30 20 10 0 1 2 3 4 5 6 7 8 Birth Weight Category 9 10 Symetrical, unimodal distribution Mean, Mode and Median 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Symetrical, bimodal distribution Mean Medain Mode Mode 18 16 14 12 10 8 6 4 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Asymmetric distribution Mode Median Mean 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Asymmetric distribution Mean Median Mode 45 40 35 30 25 20 15 10 5 0 1 2 3 4 5 6 7 8 9 10 11 12 13 Measures of Dispersion and Variability Frequency distribution of neonatal birth weight 50 40 30 20 10 0 1 2 3 4 5 6 7 8 Birth Weight Category 9 10 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Maximum Mean Minimum 0 0.2 0.4 0.6 0.8 1 1.2 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 5500 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 Maximum Mean Deviation Observationi Minimum 0 0.2 0.4 0.6 0.8 1 1.2 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 Average Deviation from the Mean --> on average, how much do the individual observations differ from the mean? n ( Xi X ) i 1 n i 1 2 3 4 5 6 7 Xi X Xi 1.2 1.4 1.6 1.8 2.0 2.2 2.4 X=12.6 n=7 1.2-1.8 = -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 X i X 0 7 i 1 12.6 X 1.8 7 Xi X X i X 2 3 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 Average Absolute Deviation from the Mean --> on average, how much do the individual observations differ from the mean? n Xi X i 1 n i 1 2 3 4 5 6 7 Xi 1.2 1.4 1.6 1.8 2.0 2.2 2.4 X=12.6 n=7 12.6 X 1.8 7 X Xi X Xi X 1.2-1.8 = -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.0 i X |1.2-1.8| = 0.6 0.4 0.2 0.0 0.2 0.4 0.6 7 Xi X i 1 7 2.4 0.34 7 2 Sum of Squared Deviations n SS ( X i X ) i 1 “Sum of Squares” 2 i 1 2 3 4 5 6 7 Xi 1.2 1.4 1.6 1.8 2.0 2.2 2.4 X=12.6 n=7 12.6 X 1.8 7 Xi X -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.0 X Xi X 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.34 n i X 2 (-0.6)2 = 0.36 0.16 0.04 0 0.04 0.16 0.36 1.12 2 ( X X ) 1.12 i i 1 Variance --> mean sum of squares n 2 ( X i 1 s ) Population N n 2 i 2 ( X i 1 i X) n 1 2 Sample i 1 2 3 4 5 6 7 Xi 1.2 1.4 1.6 1.8 2.0 2.2 2.4 X=12.6 n=7 12.6 X 1.8 7 Xi X Xi X -0.6 -0.4 -0.2 0.0 0.2 0.4 0.6 0.0 0.6 0.4 0.2 0.0 0.2 0.4 0.6 0.34 X i X (-0.6)2 = 0.36 0.16 0.04 0 0.04 0.16 0.36 1.12 n s 2 2 ( X X ) i i 1 n 1 2 1.12 0.1867 6 Standard Deviation 2 s s Population 2 Sample Coefficient of Variation s V X S expressed as a % of the mean --> allows comparison of variability among samples measured in different units or scales. Mean Deviation Variance 3Standard deviation CV 2.5 2 1.5 1 0.5 0 0.34 0.1867 0.43 0.24 0.26 0.1367 0.37 0.21 Standard Error of the Mean Recall: How x and s are estimates of μ and σ good are these measures?? Need level of uncertainty (due to sampling error) in the mean: SEx = s/√ n Confidence Intervals SE = measure of how far x is likely to be from μ 2 * SE = 95% confidence I.e. μ is inside 2 * SE 95% of the time Reporting variability about the mean. Text In a table as in previous slide. Or, for example, in a manuscript, I might write: The mean (± 95% CI) for the random samples of 100, 50, 25 and 10 was 24.84079 (±0.1816), 24.91241(±0.31996), 24.86719 (±0.40142) and 25.16212 (±0.859) respectively. You are not restricted to using the confidence intervals when reporting variability about the mean, ie I could have used mean ± std dev, or mean ± std error Graphically: Box Plot or Box and Whisker Plot 3 2 5 0 95% CI Standard Error Mean 3 1 5 0 3 0 5 0 NeonateWeight(g) 2 9 5 0 2 8 5 0 2 7 5 0 2 6 5 0 2 5 5 0 N o n -s m o k e rs S m o k e rs T y p eo fM o th e r Graphically: Box Plot or Box and Whisker Plot 3 2 5 0 95% CI Standard Error Mean 3 1 5 0 3 0 5 0 NeonateWeight(g) 2 9 5 0 2 8 5 0 2 7 5 0 2 6 5 0 2 5 5 0 N o n -s m o k e rs S m o k e rs T y p eo fM o th e r Graphically: Box Plot or Box and Whisker Plot 3 2 5 0 95% CI 3 1 5 0 Mean 3 0 5 0 NeonateWeight(g) 2 9 5 0 2 8 5 0 2 7 5 0 2 6 5 0 2 5 5 0 N o n -s m o k e rs S m o k e rs T y p eo fM o th e r Graphically: Box Plot or Box and Whisker Plot 95% CI 4 0 0 0 3 5 0 0 Mean 3 0 0 0 2 5 0 0 NeonateWeight(g) 2 0 0 0 1 5 0 0 1 0 0 0 5 0 0 0 N o n -s m o k e rs S m o k e rs T y p eo fM o th e r