Download Chapter 2 Student - Spring

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
What to put on the Board
 Job Title
 Salaries
 Mean
 Median
Describing Distributions with
Numbers
CHAPTER 2
Travel times (mins.) for 15 workers in North
Carolina
30
20
10
40
25
20
10
10
60
15
40
5
30
12
10
0
1
2
3
4
5
6
5
000025
005
00
00
0
Shape, Center, Spread
 Shape:
 Skewed right
 Center:
 20
 Spread:
 5 to 60
Measuring Center
MEAN AND MEDIAN
Finding the Mean = Average
 xi
x
n
x1  x2  ...  xn
X
n
OR
x  Sample mean
∑ = “the sum of”
n = # in sample
Example: Find mean of travel time in North
Carolina
 30 + 20 + … + 10
15
337 = 22.5 minutes
15
 The most common measure of center
 Sensitive to the influence of extreme observations
 Outliers pull the mean towards the outlier
 Skewed data pulls the mean toward the longer tail (in the
direction of the skew)
 Mean is not a resistant measure of center
because of this sensitivity
Example (find the mean of both)
 A) 1,2,3,4,5
B) 1,2,3,4,50
Median
Median = midpoint
Symbol: M
 Arrange numbers in increasing order and count to
center

n 1
If n is odd, M is located in the position 2 on the list
n 1
 If n is even, location of M is also at position
but will
2
require you to find the mean of the two center numbers
on the list since n  1 will give you a decimal (.5) value
2
Example (find the median of both)
1,2,8,9,15
1,2,8,9
 Median is a resistant measure (not influenced by
extreme observations)
1,2,3,4,5
1,2,3,4,50
Find the median for North Carolina driver’s
REFER BACK TO EXAMPLE 1
 5 10 10 10 10 12 15 20 20 25 30 30 40 40 60
(n+1)
2
M= (15 + 1)/2 = 8

Note: the formula does not give you the median, just the location
Comparing mean and median
 In a symmetric distribution, the mean and median
are equal
 x  M
 If the distribution is roughly symmetric, the mean and
median are close together.
 If the distribution is skewed, the mean is farther out in
the long tail than the median (mean is past median in the
direction of the skew).
Cont…
 Mean and median both give a measure of center.
Often one is a better choice than the other depending
on the situation and the data set.
 “average
 “typical
value” usually implies mean
value” usually implies median
Using the Graphing Calculator
 When given a larger data set, it might be easier to
find the mean and median using your TI-83/84.
 Steps
1. Hit STAT, Edit (make sure lists are all clear)
2. Enter values in L1 (type number, hit ENTER or ▼)
3. Hit STAT, over to CALC. Choose 1: 1-Var Stats. Hit
ENTER.


Mean is ( x on top of screen)
Use down arrow to scroll down and see median (Med=)
Practice Problem: Exit Ticket
 The Major League Baseball single-season home run
record is held by Barry Bonds, who hit 73 in 2001.
Below is Bond’s homeruns totals from 1986 to 2004:
16 19 24 25 25 33 33 34 34 37 37 40 42 45 45
46 46 49 73
Bond’s record year is a high outlier. How does his
career mean and median number of home runs
change when we drop the 73? What general fact
about mean and median does your results illustrate?
Measuring Spread
THE QUARTILES
Actuaries
Actuaries earned a mean annual wage of $95,420 in
2007, with the top tier hauling down a tasty
$145,600. Most (60 percent) are employed in the
insurance industry, crunching numbers to determine
risks in pension planning, insurance coverage, or
investment strategies. That means they need a high
math aptitude and financial savvy. However, many
only hold a bachelor's degree in math, business or
statistics.
Dental Hygienists
Not all dental hygienists' earnings skyrocket into the
six figures -- but there are enough that do, making
this a surprisingly rich opportunity for someone who
holds only an associate degree. The Labor
Department reports that while the median earnings
are in the high $60k range, the top-end hygienists
found themselves in the $90k range last year. And
you can prep for this career in the two-year, online
career training program and be loving life in a matter
of a few years with experience.
What do you think is the median
household income??
Average household income: 2004
 Census Bureau reported:
 Median- $44,389
 Mean- $60,528
 Bottom 10%- less than $10,927
 Upper 5%- above $157,185
Median household income (2014), 1 earner to 4 people…
Why spread?
 Mean and median are useful for center but can be
misleading or not tell us “the whole story.”

We need to also know about the spread and variability in
the data.
 The simplest useful numerical description of a
distribution requires both a measure of center and a
measure of spread.
Range
Range = difference of smallest and largest numbers
= max – min(SINGLE NUMBER ANSWER)
 Shows full spread of data, but could involve
outliers so not a great choice
 Improve description by looking at spread of middle
half of data as well:
Quartiles
Use Quartiles
 Put numbers in increasing order and find M (also
called Q2 or 2nd quartile)
 Find the median of the first half (numbers to left of
M) = Q1 (1st quartile)
 Find median of the second half (numbers to right of
M) = Q3 (3rd quartile)
Quick Facts
 25% of data is below Q1
 25% of data is above Q3
 75% of data is below Q3
 75% of data is above Q1
 50% of data is between Q1 & Q3
 50% of data is below (or above) Q2
Examples
 5, 10, 10, 10, 10, 12, 15, 20, 20, 25, 30, 30, 40, 40, 60

n=15 **If n is odd then M is not included when counting to
find Q1 & Q3
Examples cont…
 5, 10, 10, 15, 15, 20, 20, 40, 45, 60, 65, 85

n=12 **If n is even, all values are used to count into Q1 & Q3
Five Number Summary
 Gives a reasonably complete description of center
and spread
 Consists of minimum, Q1, median, Q3, maximum
 Can be used to make a Box Plot (or Box and
Whisker Plot)
Try this one
Examples: Finding Five Number Summaries
5,10,10,10,10,12,15,20,20,25,30,30,40,40,60
Examples cont…
 5, 10, 10, 15, 15, 20, 20, 40, 45, 60, 65, 85
Constructing Box and Whisker Plots
1.
Draw a number line (usually by 5’ or 10’s). Place
dots above line at each of the five values from your
Five Number Summary.
2. Draw a box around Q1 & Q3.
3. Draw a vertical line through M.
4. Draw “whiskers” out to max and min.
 Can draw more than one plot over the same axis to
do a side-by-side comparison of multiple data sets.
Called a Stacked Box Plot.
 Can discuss shape similar to histograms

Symmetric = Q1 to Med to Q3 evenly spaced

Skewed = Q1 & Q3 not evenly spaced or whiskers uneven
in length
(which also shows possible outliers)
Example
 Examples: Constructing Box Plots (use data
sets above, make stacked plot)
Assignment
 Construct a box plot for each offensive position in
the following table and have them stacked.
 10 point homework assignment
Page 59 in text book(Just do offense)
Spotting Suspected Outliers
Warm Up
 Give the five-number summary of the following 19 #s
12, 14, 15, 15, 15, 15, 18, 22, 30, 33, 33, 34, 35, 39, 40, 41, 72, 78, 91

Min-
Q1-
M-
Q3-
Max-
Finding Outliers
 Interquartile Range = the spread of the quartiles
IQR = Q3 – Q1
 Use this value when finding the boundaries for
outliers:
Upper Bound = Q3 + (IQR x 1.5)
Lower Bound = Q1 – (IQR x 1.5)
 Any data values beyond the boundaries on either end
of your list are outliers.
Examples
NC: 30,20,10,40,25,20,10,60,15,40,5,30,12,10,10
Are there any outliers??
NY: 10,30,5,25,40,20,10,15,30,20,15,20,85,15,65,15
60,60,40,45
On a Box Plot, outliers should be marked with a star (*), then
end the whisker on that side at the highest non-outlier
value.
Why Spot Outliers?
The town of Manhattan, Kansas is sometimes called
the “Little Apple” to distinguish it from the other
Manhattan. A few years ago, a house there appeared
in the country appraiser’s records valued at
$200,059,000.00. That would be quite a house,
even on Manhattan Island. As you might guess, the
entry was wrong: the true value was $59,000.00.
but before error was discovered, the country, city,
and the school board had based their budgets on the
total appraised value of real estate, which the one
outlier jacked up by 6.5%. It can pay to check for
outliers!
Measuring Spread:
The Standard Deviation
 Standard deviation= measures spread by looking
at how far the observations are from the mean.
Formula:
2
s
 (X  X)
i
n 1
 Variance= square of the standard deviation
Example
Use the following data set to complete the steps below:
41 38
39
45
47
41
Steps
1. To find these measures, the first step is always to
find the mean of the data set.
2. Make a chart to complete the rest of the calculations
(see below).
3. Subtract the mean from each number in the data
set. Make sure to include the positive or negative
sign.
This is called finding the deviation. It shows us
how much each value varies from the mean of
the set. For every data set, the sum
of this column will always be zero, so we need to
take other steps
4. Square each value from step 3 in the next column.
Add this column to get a total.
5. Divide this total by one less than the number of
entries in the data set n–1.
This is called the sample variance. Because it
involves a total of squared
values, we need to take one last step.
6. Take the square root of the answer from step 5.
This Is called the sample standard deviation.
This is the final answer of the problem. It is used for:
1. shows how much in general a data set varies from
its average
2. shows consistency when comparing data sets
(lower SD = more consistent values = closer as a
whole to the mean)
Example:
Data Values
Mean=
41
38
39
Value − Mean
45
47
41
Squares
Total of Squares=___________
Squares / (n-1)=________(Variance)
Square root=______=standard deviation
Properties
 s measures spread about the mean and should only
be used when mean is chosen as the measure of
center
 s  0 always. s=0 only when no spread (when all
observations have the same value). More spread out
= greater s
 s has the same units as original data values
 s is not resistant to outliers and skew (like the mean)
Cont…
Because X and s are sensitive to extreme
observations, they can be misleading when data is
strongly skewed or has outliers. Because of this:
 If the distribution is skewed or has outliers, describe
the data set with the Five- Number Summary
 If the distribution is reasonably symmetric and free
of outliers, describe with X and s
Choosing Measure of Center and Spread
Five-Number Summary
Mean and Standard
Deviation
 Better for describing a
 Best for reasonably
skewed distribution or
a distribution with
strong outliers
symmetric
distributions that are
free of outliers
Reminder!
 Remember that a graph gives the best overall picture
of a distribution. Numerical measures of center and
spread report specific facts about a distribution, but
they do not describe its entire shape.
 Always plot your data!
Find the standard deviation (hand in 3pts)
 10, 8, 12, 14, 16, 8