Download Center and Spread - highlandstatistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Time series wikipedia , lookup

Transcript
For Reals This Time
Center
 Three major measures of central tendency are the
mean, median, and mode.
 In a list of numbers, the mode is the most common
value.
 If there is a tie, each most common value is a mode.
 If every value appears only once, they are all the mode.
 This is where the label for unimodal, bimodal and
multimodal comes from.
Median
 The median is the point in an ordered list which has
just as much data above it as below it.
 The median is not always part of the data set.
 The median is not affected by outliers. It does not
matter how large or small the data is above or below
it…just how much data is above or below the median.
Mean
 When you think of an average, you probably think of
the mean.
 The process is as simple as adding up all of the values
and dividing by how many of them you added up.
 There are regular means, weighted means, and
trimmed means.
Regular Means
 This is what we will use pretty much the whole class
long.
 This is where you add up all the data and divide by the
number of data points in the set.
 Affected by outliers.
 Consider what Bill Gates’ income must do to the mean
income in Bellevue, WA.
Weighted Means
 This kind of mean comes up when each value has a
weight with it. In other words, some numbers count
more than others.
 Used for getting means from frequency tables, finding
center of mass, and calculating college GPAs.
 This kind of mean shows up on the AP Test fairly
often.
Weighted Means
 Consider this example:
 Fred has take 5 classes with 3 credits each (3 A’s, 2 B’s), a
4 credit class (a B), a 5 credit class (an A), and 3 1 credit
classes (2 A’s, and a B)
 In GPA, an A is worth 4, a B is worth 3, and a C is
worth 2.
 But we have to add up these values times the
number of credits they were worth, and then
divide by the total number of credits.
 This is because not every class had the same
number of credits.
Weighted Means
So here is what it looks like at first:
3  4  3  4  3  4  3  3  3  3  4  3  5  4  1 4  1 4  1 3
3 3  3  3 3  4  5  1 1 1
Once it has been simplified a bit, it is not so bad:
12  12  12  9  9  12  20  4  4  3
27
And our end result:
97
 359
.
27
Trimmed Means
 Sometimes we will want to find the mean for a data
set that has some really extreme outliers.
 If we just ignore the outliers, this make our mean
less accurate.
 For a trimmed mean we ignore one or more of the
highest values and the same number of the lowest
values.
 So we might ignore the highest and lowest value.
 Or we might ignore the highest five and lowest five
values.
Midrange
 Since this term could appear on an AP Test, it is good
to know about.
 It is the average of the maximum and minimum values
of our data set.
 It is generally rather inaccurate as a measure of central
tendency.
 It is best to use when it is the only option.
 Like when?
 Zombie Apocalypse!
Spread
 For spread there are three major measures of spread:
range, interquartile range, and standard deviation.
 The range is just subtracting the minimum value from
the maximum value.
 The way range is used commonly, as in “the data goes
from <value> to <value>” is not the way we are using
range here.
 The range will be just a single number.
Interquartile Range
 Quartiles are points that divide ordered data into
quarters.
 The first quartile, or Q1, separates the lower 25% of
the data from the upper 75%.
 The second quartile separates the lower 50% of the
data from the higher 50%. This is the median.
 The third quartile, or Q3, separates the lower 75%
of the date from the upper 25%.
Interquartile Range
 The interquartile range, or IQR, is Q3 minus Q1.
 In other words it is a range for the quartiles instead of
the whole data set.
 This measure is very useful in finding outliers.
The Five Number Summary
 A data set can be summarized with a specific set of five
numbers.
 Your calculator will do this.
 I mention this because it is the method I recommend.
 The five numbers are the start, 25% mark, 50% mark,
75% mark, and stop of the data.
 Officially, the Min, Q1, Med, Q3, and Max
Another Chart!
 A boxplot, also known as a box-and-whiskers plot, is
based on the five number summary.
 The box shows Q1, Med, and Q3.
 The whiskers go out to the Min and Max.
 A modified boxplot goes out the lowest and highest
values that are not outliers.
 The outliers get separate dots.
The IQR x 1.5 Criterion
 The find outliers, we can use the IQR x 1.5
Criterion.
 First, we need the IQR.
 Then, we multiply the IQR by 1.5.
 This is the amount that a value could reasonably
be beyond the quartiles.
 Q1 minus this number is the lower threshhold.
 Q3 plus this number is the upper threshhold.
 I will now demonstrate, although your book also gives
examples.
Standard Deviation
 Sometimes a specific case or data point will be
different than the average.
 Sometimes every case or data point will be different than
the average.
 The amount that a data point is off from the average by
is called a deviation.
 Numbers below the average have a negative standard
deviation.
Standard Deviation
 Since the average shows the center of all of the data, if
we total all of the deviations, we get zero.
 This is because the numbers that are above average are
balanced by numbers below average.
 In order to keep this from making our deviations
useless information, we will square them before we
add them.
Standard Deviation
 So we take all of these squared deviations and we add
them up.
 In order to find the average squared deviation, we
should divide by how many there are.
 We will instead divide by one less than how many
there are.
 ‘Cuz Sanford says so.
Standard Deviation
 So we find each deviation, square it, add them up, and
divide by one less than the number of data points, just
to recap.
 Then, since these are all squared deviations, we square
root them in order to put them back into the right
scale.
 This tells us how much a typical value is away from our
mean.
Standard Deviation
 The formula for this is:
s
 (x  x)
n1
2
 Which means take each value minus the mean,
square each difference, add up all the squared
differences, divide that total by one less than the
number of data points and then square root the
result.
Chapter 5’s Exciting Conclusion
Mean, Median, Mode, Midrange
 Do not use the midrange for anything serious until a
zombie apocalypse creates situations where you never
have much data but have to do calculations in seconds
or die.
 The mode is useful on a histogram as the bump, or the
number of bumps, but is generally a poor measure for
a pile of data.
 It might sometimes have use in identifying what a
“typical” score was if there is an obvious mode.
Mean, Median, Mode, Midrange
 The median gives a reliable measure of center no
matter what.
 The mean is a more powerful value as long as two
conditions are met.
 The data must be symmetric
 The data must not have outliers.
 In later units, when the data looks close enough to the
“usual lump” we will still use the mean even if there is
slight skew or outliers.
Range, IQR, and Standard Deviation
 The range is useful in situations where you need to
make calculations quickly and do not have much data.
 i.e. Zombie Apocalypse
 The IQR always works, like the median does, and
should be used whenever you are using the median.
 It is used for the IQR x 1.5 criterion for outliers as well
even if you use a different measure of spread.
Range, IQR, and Standard Deviation
 The standard deviation is good for use when you
would use the mean.
 Symmetric distributions with no outliers.
 The standard deviation is more powerful, and is the
preference when you can use it.
 To summarize:
 Mean and standard deviation go together when there are
no outliers or skew.
 Median and IQR go together the rest of the time.
 In case of zombie apocalypse, rely on range and
midrange.
Comparing Distributions
 When we compare two distributions, the common
choices are as follows:
 Paired bar chart for categorical data.
 For Display Only: Stemplot for two small quantitative
distributions (30 or fewer data points each) when no
convenient technology is available or two histograms
when it is.
 For Further Analysis: Boxplots for quantitative data
large enough to have a 5 number summary. Not limited
to only two charts.
Comparing Many Distributions
 When comparing more than two distributions of
quantitative data, the boxplot is the way to go.
 Your calculator will create awesome boxplots.
 When comparing boxplots, we want to compare the
shape, center, and spread.
 We can usually do this by comparing the low whiskers,
the high whiskers, and the position of the box.
The Worksheet of Doom
 I am handing you the worksheet of doom.
 It is not for a grade.
 I will be going over it in class tomorrow.
 If you really want to do well on this upcoming test in 2
weeks, you will attempt the worksheet tonight as best
as you can and fill in the gaps tomorrow.
 In case you are not in class tomorrow, I have posted a
walkthrough of this, which basically fills you in on the
same things…just less usefully than actually attending.
Assignments
 Read chapter 5.
 Study for your chapter 5 quiz on Friday.
 Let Mr. Sanford explain what comes next.
 Chapter 5: 5, 9, 12, 13, 14, 17, 18, 19, 21, 25, 29, 33, 34, 37, 41,
45.
 Read all of them, and then do eight of them. At least 3
from the first half (through 19) and at least 3 from the
second half (from 21 on).
 The worksheet I am handing out is not for a grade, but we
will lecture over it.
Chapter 5 Quiz Bulletpoints
 Be able to sketch a histogram and a (modified)
boxplot of the same data set.
 Be able to find the median, mean, IQR, and
standard deviation, as well as to discuss which
ones should be used.
 Be able to use a five number summary to discuss
whether the mean is above or below the median.
 Know how a change in the numbers affects the
mean and standard deviation.
 Be able to compare two boxplots.