Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Welcome to Week 04 College Statistics http://media.dcnews.ro/image/201109/w670/statistics.jpg Descriptive Statistics Averages tell where the data tends to pile up Descriptive Statistics Another good way to describe data is how spread out it is Descriptive Statistics Suppose you are using the mean “5” to describe each of the observations in your sample VARIABILITY IN-CLASS PROBLEMS For which sample would “5” be closer to the actual data values? VARIABILITY IN-CLASS PROBLEMS In other words, for which of the two sets of data would the mean be a better descriptor? VARIABILITY IN-CLASS PROBLEMS For which of the two sets of data would the mean be a better descriptor? Variability Numbers telling how spread out our data values are are called “Measures of Variability” Variability The variability tells how close to the “average” the sample data tend to be Variability Just like measures of central tendency, there are several measures of variability Variability Range = max – min Variability Interquartile range (symbolized IQR): IQR = 3rd quartile – 1st quartile Variability “Range Rule of Thumb” A quick-and-dirty variance measure: (Max – Min)/4 Variability Variance (symbolized s2) sum of (obs – x)2 s2 = n - 1 Variability An observation “x” minus the mean x is called a “deviation” The variance is sort of an average (arithmetic mean) of the squared deviations Variability Sums of squared deviations are used in the formula for a circle: r2 = (x-h)2 + (y-k)2 where r is the radius of the circle and (h,k) is its center Variability OK… so if its sort of an arithmetic mean, howcum is it divided by “n-1” not “n”? Variability Every time we estimate something in the population using our sample we have used up a bit of the “luck” that we had in getting a (hopefully) representative sample Variability To make up for that, we give a little edge to the opposing side of the story Variability Since a small variability means our sample arithmetic mean is a better estimate of the population mean than a large variability is, we bump up our estimate of variability a tad to make up for it Variability Dividing by “n” would give us a smaller variance than dividing by “n-1”, so we use that Variability Why not “n-2”? Variability Why not “n-2”? Because we only have used 1 estimate to calculate the variance: x Variability So, the variance is sort of an average (arithmetic mean) of the squared deviations bumped up a tad to make up for using an estimate (x) of the population mean (μ) Variability Trust me… Variability Standard deviation (symbolized “s” or “std”) s = variance Variability The standard deviation is an average square root of a sum of squared deviations We’ve used this before in distance calculations: d = (x1−x2)2 + (y1−y2)2 Variability The range, interquartile range and standard deviation are in the same units as the original data (a good thing) The variance is in squared units (which can be confusing…) Variability Naturally, the measure of variability used most often is the hard-to-calculate one… Variability Naturally, the measure of variability used most often is the hard-to-calculate one… … the standard deviation Variability Statisticians like it because it is an average distance of all of the data from the center – the arithmetic mean Variability Range = max – min IQR = 3rd quartile – 1st quartile Range Rule of Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Questions? Variability Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 What is the range? VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Min Max Data: 1 1 2 2 3 3 Range = 3 – 1 = 2 VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 What is the IQR? VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Q1 Median Q3 Data: 1 1 2 2 3 3 IQR = 3 – 1 = 2 VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 What is the Thumb? VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Min Max Data: 1 1 2 2 3 3 Thumb = (3-1)/4 = 0.5 VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 What is the Variance? VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 First find x! VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 3+3+2+2+1+1 x = = 2 6 VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 Now calculate the deviations! VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 Dev: 1-2=-1 1-2=-1 2-2=0 2-2=0 3-2=1 3-2=1 Variability What do you get if you add up all of the deviations? Data: 1 1 2 2 3 3 Dev: 1-2=-1 1-2=-1 2-2=0 2-2=0 3-2=1 3-2=1 Variability Zero! Variability Zero! That’s true for ALL deviations everywhere in all times! Variability Zero! That’s true for ALL deviations everywhere in all times! That’s why they are squared in the sum of squares! VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 Dev: -1 =1 -1 =1 2 2 2 0 =0 2 2 0 =0 2 3 1 =1 2 3 1 =1 2 VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 sum(obs–x)2: 1+1+0+0+1+1 = 4 VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 Variance: 4/(6-1) = 4/5 = 0.8 YAY! VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 What is s? VARIABILITY IN-CLASS PROBLEMS Range = max – min IQR = 3rd quartile – 1st quartile Thumb = (max – min)/4 sum of (obs – x)2 Variance = n - 1 s = variance Data: 1 1 2 2 3 3 s = 0.8 ≈ 0.89 VARIABILITY IN-CLASS PROBLEMS So, for: Data: 1 1 2 2 3 3 Range = max – min = 2 IQR = 3rd quartile – 1st quartile = 2 Thumb = (max – min)/4 = 0.5 2 sum of (obs – x) Variance = = 0.8 n - 1 s = variance ≈ 0.89 Variability Aren’t you glad Excel does all this for you??? Questions? Variability Just like for n and N and x and μ there are population variability symbols, too! Variability Naturally, these are going to have funny Greek-y symbols just like the averages … Variability The population variance 2 is “σ ” called “sigma-squared” The population standard deviation is “σ” called “sigma” Variability Again, the sample statistics s2 and s values estimate population parameters σ2 and σ (which are unknown) Variability Some calculators can find x s and σ for you (Not recommended for large data sets – use EXCEL) Variability s sq vs sigma sq Variability s sq is divided by “n-1” sigma sq is divided by “n” Questions? Variability Outliers! They can really affect your statistics! OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the mode affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original mode: 1 New mode: 1 OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the midrange affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original midrange: 3 New midrange: 371 OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the median affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original median: 1.5 New median: 1.5 OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the mean affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original mean: 2 New mean: 124 𝟓 𝟔 𝟏 𝟔 Outliers! How about measures of variability? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the range affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original range: 4 New range: 740 OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the interquartile range affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original IQR: 2.5 – 1 = 1.5 New IQR: 1.5 OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the variance affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original s2: ≈2.57 New s2: ≈91,119.37 OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Is the standard deviation affected? OUTLIERS IN-CLASS PROBLEMS Suppose 1 1 1 Suppose 1 1 1 we 2 we 2 originally had data: 3 5 now have data: 3 741 Original s: ≈1.60 New s: ≈301.86 Questions? Descriptive Statistics Last week we got this summary table from Excel Descriptive Statistics Beans Liquor Butter BEQ Mean 72,836.8 5,230.8 18,537.5 104,030.2 Standard Error 1,835.5 309.9 593.1 1,528.7 Median 72,539.0 5,020.0 18,011.3 104,617.2 Mode #N/A #N/A #N/A #N/A Standard Deviation 9,359.4 1,580.2 3,024.1 7,794.8 Sample Variance 87,599,301.8 2,496,988.9 9,145,138.6 60,759,154.8 Kurtosis -1.2 -0.2 -1.3 -1.0 Skewness 0.0 0.1 0.3 -0.1 Range 32,359.4 6,477.2 9,384.7 27,075.8 Midrange 71,625.3 5,076.6 19,263.4 103,849.2 Minimum 55,445.6 1,838.0 14,571.0 90,311.3 Maximum 87,805.0 8,315.2 23,955.7 117,387.1 Sum 1,893,757.1 136,000.0 481,975.2 2,704,784.1 Count 26.0 26.0 26.0 26.0 Descriptive Statistics Which are Measures of Central Tendency? Beans Liquor Butter BEQ Mean 72,836.8 5,230.8 18,537.5 104,030.2 Standard Error 1,835.5 309.9 593.1 1,528.7 Median 72,539.0 5,020.0 18,011.3 104,617.2 Mode #N/A #N/A #N/A #N/A Standard Deviation 9,359.4 1,580.2 3,024.1 7,794.8 Sample Variance 87,599,301.8 2,496,988.9 9,145,138.6 60,759,154.8 Kurtosis -1.2 -0.2 -1.3 -1.0 Skewness 0.0 0.1 0.3 -0.1 Range 32,359.4 6,477.2 9,384.7 27,075.8 Midrange 71,625.3 5,076.6 19,263.4 103,849.2 Minimum 55,445.6 1,838.0 14,571.0 90,311.3 Maximum 87,805.0 8,315.2 23,955.7 117,387.1 Sum 1,893,757.1 136,000.0 481,975.2 2,704,784.1 Count 26.0 26.0 26.0 26.0 Descriptive Statistics Which are Measures of Central Tendency? Beans Liquor Butter BEQ Mean 72,836.8 5,230.8 18,537.5 104,030.2 Standard Error 1,835.5 309.9 593.1 1,528.7 Median 72,539.0 5,020.0 18,011.3 104,617.2 Mode #N/A #N/A #N/A #N/A Standard Deviation 9,359.4 1,580.2 3,024.1 7,794.8 Sample Variance 87,599,301.8 2,496,988.9 9,145,138.6 60,759,154.8 Kurtosis -1.2 -0.2 -1.3 -1.0 Skewness 0.0 0.1 0.3 -0.1 Range 32,359.4 6,477.2 9,384.7 27,075.8 Midrange 71,625.3 5,076.6 19,263.4 103,849.2 Minimum 55,445.6 1,838.0 14,571.0 90,311.3 Maximum 87,805.0 8,315.2 23,955.7 117,387.1 Sum 1,893,757.1 136,000.0 481,975.2 2,704,784.1 Count 26.0 26.0 26.0 26.0 Descriptive Statistics Which are Measures of Variability? Beans Liquor Butter BEQ Mean 72,836.8 5,230.8 18,537.5 104,030.2 Standard Error 1,835.5 309.9 593.1 1,528.7 Median 72,539.0 5,020.0 18,011.3 104,617.2 Mode #N/A #N/A #N/A #N/A Standard Deviation 9,359.4 1,580.2 3,024.1 7,794.8 Sample Variance 87,599,301.8 2,496,988.9 9,145,138.6 60,759,154.8 Kurtosis -1.2 -0.2 -1.3 -1.0 Skewness 0.0 0.1 0.3 -0.1 Range 32,359.4 6,477.2 9,384.7 27,075.8 Midrange 71,625.3 5,076.6 19,263.4 103,849.2 Minimum 55,445.6 1,838.0 14,571.0 90,311.3 Maximum 87,805.0 8,315.2 23,955.7 117,387.1 Sum 1,893,757.1 136,000.0 481,975.2 2,704,784.1 Count 26.0 26.0 26.0 26.0 Descriptive Statistics Which are Measures of Variability? Beans Liquor Butter BEQ Mean 72,836.8 5,230.8 18,537.5 104,030.2 Standard Error 1,835.5 309.9 593.1 1,528.7 Median 72,539.0 5,020.0 18,011.3 104,617.2 Mode #N/A #N/A #N/A #N/A Standard Deviation 9,359.4 1,580.2 3,024.1 7,794.8 Sample Variance 87,599,301.8 2,496,988.9 9,145,138.6 60,759,154.8 Kurtosis -1.2 -0.2 -1.3 -1.0 Skewness 0.0 0.1 0.3 -0.1 Range 32,359.4 6,477.2 9,384.7 27,075.8 Midrange 71,625.3 5,076.6 19,263.4 103,849.2 Minimum 55,445.6 1,838.0 14,571.0 90,311.3 Maximum 87,805.0 8,315.2 23,955.7 117,387.1 Sum 1,893,757.1 136,000.0 481,975.2 2,704,784.1 Count 26.0 26.0 26.0 26.0 Questions? Variability Ok… swell… but WHAT DO YOU USE THESE MEASURES OF VARIABILITY FOR??? Variability From last week – THE BEANS! Mean Standard Deviation Sample Variance Range Minimum Maximum Moong Moong Moong Black- Black- BlackCran- CranLima- LimaFava- Fava-L -W -D L W D Cran-L W D Lima-L W D Fava-L W D 4.77 3.38 3.00 8.23 5.54 4.15 12.85 7.85 5.92 20.77 13.08 6.54 27.92 17.77 8.00 0.44 0.65 0.71 1.01 0.78 0.90 1.21 0.69 0.86 1.01 1.12 1.66 1.75 1.36 0.19 1.00 4.00 5.00 0.42 2.00 2.00 4.00 0.50 1.03 2.00 3.00 2.00 7.00 4.00 10.00 0.60 3.00 4.00 7.00 0.81 1.47 2.00 4.00 3.00 10.00 5.00 14.00 0.47 2.00 7.00 9.00 0.74 1.03 1.24 2.77 3.08 1.86 5.83 3.00 4.00 4.00 7.00 5.00 5.00 10.00 4.00 19.00 11.00 4.00 26.00 15.00 5.00 7.00 23.00 15.00 11.00 31.00 20.00 15.00 We wanted to know – could you use sieves to separate the beans? 2.42 Variability You could have plotted the mean measurement for each bean type: Variability This might have helped you tell whether sieves could separate the types of beans Variability But… beans are not all “average” – smaller beans might slip through the holes of the sieve! How could you tell if the beans were totally separable? Variability Make a graph that includes not just the average, but also the spread of the measurements! Variability New Excel Graph: hi-lo-close Variability Rearrange your data so that the labels are followed by the maximums, then the minimums, then the means: Maximum Minimum Mean Moong Moong Moong Black- Black- BlackCran- CranLima- LimaFava- Fava-L -W -D L W D Cran-L W D Lima-L W D Fava-L W D 5.00 4.00 4.00 10.00 7.00 5.00 14.00 9.00 7.00 23.00 15.00 11.00 31.00 20.00 15.00 4.00 2.00 2.00 7.00 4.00 3.00 10.00 7.00 4.00 19.00 11.00 4.00 26.00 15.00 5.00 4.77 3.38 3.00 8.23 5.54 4.15 12.85 7.85 5.92 20.77 13.08 6.54 27.92 17.77 8.00 Highlight this data Click “Insert” Click “Other Charts” Click the first Stock chart: “Hi-Lo-Close” Ugly… as usual …but informative! Left click the graph area Click on “Layout” Enter title and y-axis label: Click one of the “mean” markers on the graph Click Format Data Series Click Marker Options to adjust the markers Repeat for the max (top of black vertical line) and min (bottom of black vertical line) TAH DAH! Which beans can you sieve? Questions? How to Lie with Statistics #4 You can probably guess… It involves using the type of measure of variability that serves your purpose best This is almost always the smallest one Questions?