Download Center and Spread - highlandstatistics

For Reals This Time Center  Three major measures of central tendency are the mean, median, and mode.  In a list of numbers, the mode is the most common value.  If there is a tie, each most common value is a mode.  If every value appears only once, they are all the mode.  This is where the label for unimodal, bimodal and multimodal comes from. Median  The median is the point in an ordered list which has just as much data above it as below it.  The median is not always part of the data set.  The median is not affected by outliers. It does not matter how large or small the data is above or below it…just how much data is above or below the median. Mean  When you think of an average, you probably think of the mean.  The process is as simple as adding up all of the values and dividing by how many of them you added up.  There are regular means, weighted means, and trimmed means. Regular Means  This is what we will use pretty much the whole class long.  This is where you add up all the data and divide by the number of data points in the set.  Affected by outliers.  Consider what Bill Gates’ income must do to the mean income in Bellevue, WA. Weighted Means  This kind of mean comes up when each value has a weight with it. In other words, some numbers count more than others.  Used for getting means from frequency tables, finding center of mass, and calculating college GPAs.  This kind of mean shows up on the AP Test fairly often. Weighted Means  Consider this example:  Fred has take 5 classes with 3 credits each (3 A’s, 2 B’s), a 4 credit class (a B), a 5 credit class (an A), and 3 1 credit classes (2 A’s, and a B)  In GPA, an A is worth 4, a B is worth 3, and a C is worth 2.  But we have to add up these values times the number of credits they were worth, and then divide by the total number of credits.  This is because not every class had the same number of credits. Weighted Means So here is what it looks like at first: 3  4  3  4  3  4  3  3  3  3  4  3  5  4  1 4  1 4  1 3 3 3  3  3 3  4  5  1 1 1 Once it has been simplified a bit, it is not so bad: 12  12  12  9  9  12  20  4  4  3 27 And our end result: 97  359 . 27 Trimmed Means  Sometimes we will want to find the mean for a data set that has some really extreme outliers.  If we just ignore the outliers, this make our mean less accurate.  For a trimmed mean we ignore one or more of the highest values and the same number of the lowest values.  So we might ignore the highest and lowest value.  Or we might ignore the highest five and lowest five values. Midrange  Since this term could appear on an AP Test, it is good to know about.  It is the average of the maximum and minimum values of our data set.  It is generally rather inaccurate as a measure of central tendency.  It is best to use when it is the only option.  Like when?  Zombie Apocalypse! Spread  For spread there are three major measures of spread: range, interquartile range, and standard deviation.  The range is just subtracting the minimum value from the maximum value.  The way range is used commonly, as in “the data goes from <value> to <value>” is not the way we are using range here.  The range will be just a single number. Interquartile Range  Quartiles are points that divide ordered data into quarters.  The first quartile, or Q1, separates the lower 25% of the data from the upper 75%.  The second quartile separates the lower 50% of the data from the higher 50%. This is the median.  The third quartile, or Q3, separates the lower 75% of the date from the upper 25%. Interquartile Range  The interquartile range, or IQR, is Q3 minus Q1.  In other words it is a range for the quartiles instead of the whole data set.  This measure is very useful in finding outliers. The Five Number Summary  A data set can be summarized with a specific set of five numbers.  Your calculator will do this.  I mention this because it is the method I recommend.  The five numbers are the start, 25% mark, 50% mark, 75% mark, and stop of the data.  Officially, the Min, Q1, Med, Q3, and Max Another Chart!  A boxplot, also known as a box-and-whiskers plot, is based on the five number summary.  The box shows Q1, Med, and Q3.  The whiskers go out to the Min and Max.  A modified boxplot goes out the lowest and highest values that are not outliers.  The outliers get separate dots. The IQR x 1.5 Criterion  The find outliers, we can use the IQR x 1.5 Criterion.  First, we need the IQR.  Then, we multiply the IQR by 1.5.  This is the amount that a value could reasonably be beyond the quartiles.  Q1 minus this number is the lower threshhold.  Q3 plus this number is the upper threshhold.  I will now demonstrate, although your book also gives examples. Standard Deviation  Sometimes a specific case or data point will be different than the average.  Sometimes every case or data point will be different than the average.  The amount that a data point is off from the average by is called a deviation.  Numbers below the average have a negative standard deviation. Standard Deviation  Since the average shows the center of all of the data, if we total all of the deviations, we get zero.  This is because the numbers that are above average are balanced by numbers below average.  In order to keep this from making our deviations useless information, we will square them before we add them. Standard Deviation  So we take all of these squared deviations and we add them up.  In order to find the average squared deviation, we should divide by how many there are.  We will instead divide by one less than how many there are.  ‘Cuz Sanford says so. Standard Deviation  So we find each deviation, square it, add them up, and divide by one less than the number of data points, just to recap.  Then, since these are all squared deviations, we square root them in order to put them back into the right scale.  This tells us how much a typical value is away from our mean. Standard Deviation  The formula for this is: s  (x  x) n1 2  Which means take each value minus the mean, square each difference, add up all the squared differences, divide that total by one less than the number of data points and then square root the result. Chapter 5’s Exciting Conclusion Mean, Median, Mode, Midrange  Do not use the midrange for anything serious until a zombie apocalypse creates situations where you never have much data but have to do calculations in seconds or die.  The mode is useful on a histogram as the bump, or the number of bumps, but is generally a poor measure for a pile of data.  It might sometimes have use in identifying what a “typical” score was if there is an obvious mode. Mean, Median, Mode, Midrange  The median gives a reliable measure of center no matter what.  The mean is a more powerful value as long as two conditions are met.  The data must be symmetric  The data must not have outliers.  In later units, when the data looks close enough to the “usual lump” we will still use the mean even if there is slight skew or outliers. Range, IQR, and Standard Deviation  The range is useful in situations where you need to make calculations quickly and do not have much data.  i.e. Zombie Apocalypse  The IQR always works, like the median does, and should be used whenever you are using the median.  It is used for the IQR x 1.5 criterion for outliers as well even if you use a different measure of spread. Range, IQR, and Standard Deviation  The standard deviation is good for use when you would use the mean.  Symmetric distributions with no outliers.  The standard deviation is more powerful, and is the preference when you can use it.  To summarize:  Mean and standard deviation go together when there are no outliers or skew.  Median and IQR go together the rest of the time.  In case of zombie apocalypse, rely on range and midrange. Comparing Distributions  When we compare two distributions, the common choices are as follows:  Paired bar chart for categorical data.  For Display Only: Stemplot for two small quantitative distributions (30 or fewer data points each) when no convenient technology is available or two histograms when it is.  For Further Analysis: Boxplots for quantitative data large enough to have a 5 number summary. Not limited to only two charts. Comparing Many Distributions  When comparing more than two distributions of quantitative data, the boxplot is the way to go.  Your calculator will create awesome boxplots.  When comparing boxplots, we want to compare the shape, center, and spread.  We can usually do this by comparing the low whiskers, the high whiskers, and the position of the box. The Worksheet of Doom  I am handing you the worksheet of doom.  It is not for a grade.  I will be going over it in class tomorrow.  If you really want to do well on this upcoming test in 2 weeks, you will attempt the worksheet tonight as best as you can and fill in the gaps tomorrow.  In case you are not in class tomorrow, I have posted a walkthrough of this, which basically fills you in on the same things…just less usefully than actually attending. Assignments  Read chapter 5.  Study for your chapter 5 quiz on Friday.  Let Mr. Sanford explain what comes next.  Chapter 5: 5, 9, 12, 13, 14, 17, 18, 19, 21, 25, 29, 33, 34, 37, 41, 45.  Read all of them, and then do eight of them. At least 3 from the first half (through 19) and at least 3 from the second half (from 21 on).  The worksheet I am handing out is not for a grade, but we will lecture over it. Chapter 5 Quiz Bulletpoints  Be able to sketch a histogram and a (modified) boxplot of the same data set.  Be able to find the median, mean, IQR, and standard deviation, as well as to discuss which ones should be used.  Be able to use a five number summary to discuss whether the mean is above or below the median.  Know how a change in the numbers affects the mean and standard deviation.  Be able to compare two boxplots.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Center and Spread - highlandstatistics