Download Slide 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychometrics wikipedia , lookup

Mean field particle methods wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Numerical descriptors
BPS chapter 2
© 2006 W.H. Freeman and Company
Objectives (BPS chapter 2)
Describing distributions with numbers

Measures of center: mean and median

Measures of spread: quartiles and standard deviation

The five-number summary and boxplots

IQR and outliers

Choosing among summary statistics

Using technology

Organizing a statistical problem
Measure of center: the mean
The mean or arithmetic average
The data to the right are heights (in
inches) of 25 women. How would
you calculate the average, or mean,
height of these 25 women?
Sum of heights is 1598.3
Divided by 25 women = 63.9 inches
58 .2
59 .5
60 .7
60 .9
61 .9
61 .9
62 .2
62 .2
62 .4
62 .9
63 .9
63 .1
63 .9
64 .0
64 .5
64 .1
64 .8
65 .2
65 .7
66 .2
66 .7
67 .1
67 .8
68 .9
69 .6
The mean (page 38)
woman
(i)
height
(x)
woman
(i)
height
(x)
i=1
x1= 58.2
i = 14
x14= 64.0
i=2
x2= 59.5
i = 15
x15= 64.5
i=3
x3= 60.7
i = 16
x16= 64.1
i=4
x4= 60.9
i = 17
x17= 64.8
i=5
x5= 61.9
i = 18
x18= 65.2
i=6
x6= 61.9
i = 19
x19= 65.7
i=7
x7= 62.2
i = 20
x20= 66.2
i=8
x8= 62.2
i = 21
x21= 66.7
i=9
x9= 62.4
i = 22
x22= 67.1
i = 10
x10= 62.9
i = 23
x23= 67.8
i = 11
x11= 63.9
i = 24
x24= 68.9
i = 12
x12= 63.1
i = 25
x25= 69.6
i = 13
x13= 63.9
n=25
S=1598.3
Mathematical notation:
x1  x2  ....  xn
x
n
1 n
x   xi
n i 1
1598.3
x
 63.9
25
Let’s try an example with fewer numbers….Dr. L’s Test Score Data…
Measure of center: the mean
The mean or arithmetic average
Consider the following sample of test scores
from one of Dr. L.’s recent classes (max score =
100):
65, 65, 70, 75, 78, 80, 83, 87, 91, 94
What is the mean score?
78.8
The Mean as a Center of Mass


What happens when we average two numbers? What does the
mean tell us?
Let’s draw both a dot plot and a stem and leaf plot of the test
score data and look at where the mean falls…
65, 65, 70, 75, 78, 80, 83, 87, 91, 94
Your numerical summary must be meaningful
Height of 25 women in a class
x  63.9
Here the shape of
the distribution is
wildly irregular.
Why?
Could we have
more than one
plant species or
phenotype?
The distribution of women’s
height appears coherent and
symmetric. The mean is a good
numerical summary.
x  69.6
Height of plants by color
x  63.9
5
x  70.5
x  78.3
red
Number of plants
4
pink
blue
3
2
1
0
58
60
62
64
66
68
70
72
74
76
78
80
82
84
Height in centimeters
Sometimes a single numerical summary does not give the whole picture.
Measure of center: the median
The median is the midpoint of a distribution—the number such
that half of the observations are smaller and half are larger.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
12
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
25 12
6.1
1. Sort observations from smallest to largest.
n = number of observations
______________________________
2. If n is odd, the median is
observation (n+1)/2 down the list
 n = 25
(n+1)/2 = 26/2 = 13
Median = 3.4
3. If n is even, the median is the
mean of the two center observations
n = 24 
n/2 = 12
Median = (3.3+3.4) /2 = 3.35
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
1
2
3
4
5
6
7
8
9
10
11
1
2
3
4
5
6
7
8
9
10
11
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
Measure of center: the median
Back to our test score example:
Consider the following sample of test scores
from one of Dr. L.’s recent classes (max score =
100):
65, 65, 70, 75, 78, 80, 83, 87, 91, 94
What is the median score?
79
Comparing the Mean & Median


Test Scores: 65, 65, 70, 75, 78, 80, 83, 87, 91, 94
Let’s Use our TI-83 Calculators to Find the Mean &
Median!




Enter data into a list via Stat|Edit
Use Stat|Calc|1-Var Stats
What happens to the Mean and Median if the lowest
score was 20 instead of 65?
What happens to the Mean and Median if a low score of
20 is added to the data set (so we would now have 11
data points?)
What can we say about the Mean versus the Median?
Comparing the mean and the median
The mean and the median are the similar when a distribution is
symmetric. The median is a measure of center that is resistant to skew
and outliers. The mean is not.
Mean and median for a
symmetric distribution
Mean
Median
Mean and median for
skewed distributions
Left skew
Mean
Median
Mean
Median
Right skew
Mean and median of a distribution with outliers
Percent of people dying
x  3.4
x  4.2
Without the outliers
With the outliers
The mean is pulled to the right
The median, on the other hand,
by the outliers high outliers
is only slightly pulled to the right
(from 3.4 to 4.2).
by the outliers (from 3.4 to 3.6).
Impact of skewed data
Mean and median of a symmetric
distribution
Disease X:
x  3.4
M  3.4
Mean and median have similar values
and a right-skewed distribution
Multiple myeloma:
x  3.4
M  2.5
The mean is pulled toward
the skew.
Measure of spread: quartiles (page 42)
The first quartile, Q1, is the value in
the sample that has 25% of the data
at or below it.
M = median = 3.4
The third quartile, Q3, is the value in
the sample that has 75% of the data
at or below it.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
1
2
3
4
5
6
7
1
2
3
4
5
1
2
3
4
5
6
7
1
2
3
4
5
0.6
1.2
1.6
1.9
1.5
2.1
2.3
2.3
2.5
2.8
2.9
3.3
3.4
3.6
3.7
3.8
3.9
4.1
4.2
4.5
4.7
4.9
5.3
5.6
6.1
Q1= first quartile = 2.2
Q3= third quartile = 4.35
The Five Number Summary (page 43)
The Boxplot (page 44)

A graphical representation of the five number summary.
Center and spread in boxplots
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6
5
4
3
2
1
6.1
5.6
5.3
4.9
4.7
4.5
4.2
4.1
3.9
3.8
3.7
3.6
3.4
3.3
2.9
2.8
2.5
2.3
2.3
2.1
1.5
1.9
1.6
1.2
0.6
Largest = max = 6.1
7
Q3= third quartile
= 4.35
M = median = 3.4
6
Years until death
25
24
23
22
21
20
19
18
17
16
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
5
4
3
2
1
Q1= first quartile
= 2.2
Smallest = min = 0.6
0
Disease X
A relatively symmetric data set
Boxplots for skewed data
Boxplots remain true to the data and clearly depict symmetry or skewness.
Which boxplot is of the data in the top
histogram? In the bottom histogram?
Years until death
15
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
Disease X
Multiple myeloma
IQR and outliers (page 46, 47)
The interquartile range (IQR) is the distance between the first and
third quartiles (the length of the box in the boxplot)
IQR = Q3 - Q1
An outlier is an individual value that falls outside the overall pattern.

How far outside the overall pattern does a value have to fall to be
considered an outlier?

Low outlier: any value < Q1 – 1.5 IQR

High outlier: any value > Q3 + 1.5 IQR
Let’s Find the Five Number Summary,
IQR, Box Plot, and where Outliers would
be for the Test Score Data:
65, 65, 70, 75, 78, 80, 83, 87, 91, 94
What do we notice about symmetry?
Measures of Spread: Standard
Deviation (page 48)

Other Measures of Spread


Data Range (Max – Min)
IQR (75% Quartile minus 25% Quartile, i.e. the range
of the middle 50% of data)
Standard Deviation (Variance)

Measures how the data deviates from the
mean….hmm…how can we do this?
Computing Variance and Std. Dev. by
Hand and Via the TI83:

Recall the Sample Test Score Data:
65, 65, 70, 75, 78, 80, 83, 87, 91, 94


Recall the Sample Mean (X bar) was 78.8
We want to measure how the data deviates from the mean
78.8
65
4.2
-13.8
65
70
75
80
x
What does the number
4.2 measure? How
about -13.8?
83
85
90
95
Measure of spread: standard deviation (page 48)
The standard deviation is used to describe the variation around the mean.
1) First calculate the variance s2.
1 n
2
s 
(
x

x
)
 i
n 1 1
2
2) Then take the square root to get
the standard deviation s.
1 n
2
s
(
x

x
)

i
n 1 1
Calculations …
1 n
2
s
( xi  x )

n 1 1
Mean = 63.4
Sum of squared deviations from mean = 85.2
Degrees freedom (df) = (n − 1) = 13
Women’s height (inches)
i
xi
x
(xi-x)
(xi-x)2
1
59
63.4
−4.4
19.0
2
60
63.4
−3.4
11.3
3
61
63.4
−2.4
5.6
4
62
63.4
−1.4
1.8
5
62
63.4
−1.4
1.8
6
63
63.4
−0.4
0.1
7
63
63.4
−0.4
0.1
8
63
63.4
−0.4
0.1
9
64
63.4
0.6
0.4
10
64
63.4
0.6
0.4
11
65
63.4
1.6
2.7
12
66
63.4
2.6
7.0
13
67
63.4
3.6
13.3
14
68
63.4
4.6
21.6
Sum
0.0
Sum
85.2
Mean
63.4
s2 = variance = 85.2/13 = 6.55 inches squared
s = standard deviation = √6.55 = 2.56 inches
We’ll never calculate these by hand, so make sure you know how
to get the standard deviation using your calculator.
Standard Deviation




On the next slide are histograms of quiz scores (from 1 to 10) for
the same class but taught by different professors.
Sort the classes from largest to smallest based on the mean quiz
score.
Sort the classes from largest to smallest based on the standard
deviation of the quiz score.
Which professor would you want to have for this class?
Quiz Scores (Same Class, Different Professors)
2
3
4
5
6
7
8
45
40
35
30
25
20
15
10
5
0
% of Students
45
40
35
30
25
20
15
10
5
0
1
9
1
2
3
Quiz Score
4
5
6
Quiz Score
6
7
8
9
1
2
3
7
8
9
2
3
4
5
6
Quiz Score
5
6
7
8
9
Quiz Scores for Professor F's Class
45
40
35
30
25
20
15
10
5
0
1
4
Quiz Score
% of Students
% of Students
% of Students
3
5
Quiz Scores for Professor E's Class
45
40
35
30
25
20
15
10
5
0
2
4
45
40
35
30
25
20
15
10
5
0
Quiz Score
Quiz Scores for Professor D's Class
1
Quiz Scores for Professor C's Class
Quiz Scores for Professor B's Class
% of Students
% of Students
Quiz Scores for Professor A's Class
7
8
9
45
40
35
30
25
20
15
10
5
0
1
2
3
4
5
6
Quiz Score
7
8
9
Sorted by Mean
Prof A
Mean =
Std Dev =
Prof B
Prof C
Prof D
Prof E
Prof F
5
5
5
5
5
7
2.04
3.05
2.63
3.33
3.84
1.28
Sorted by Standard Deviation
Prof F
Mean =
Std Dev =
Prof A
Prof C
Prof B
Prof D
Prof E
7
5
5
5
5
5
1.28
2.04
2.63
3.05
3.33
3.84
Which professor would you want to take for this class?
Software output for summary statistics:
Excel—From Menu:
Tools/Data Analysis/
Descriptive Statistics
Give common
statistics of your
sample data.
Minitab
Choosing among summary statistics

Because the mean is not
resistant to outliers or skew, use it
to describe distributions that are
fairly symmetric and don’t have
outliers.
 Plot the mean and use the
standard deviation for error bars.
Otherwise, use the median in the
five-number summary, which can
be plotted as a boxplot.
Height of 30 women
69
68
67
Height in inches

66
65
64
63
62
61
60
59
58
Boxplot
plot
Box
Mean +/sd
Mean
± s.d.
What should you use? When and why?
Arithmetic mean or median?

Middletown is considering imposing an income tax on citizens. City hall
wants a numerical summary of its citizens’ incomes to estimate the total tax
base.


Mean: Although income is likely to be right-skewed, the city government
wants to know about the total tax base.
In a study of standard of living of typical families in Middletown, a sociologist
makes a numerical summary of family income in that city.

Median: The sociologist is interested in a “typical” family and wants to
lessen the impact of extreme incomes.
Organizing a statistical problem (page 53)

State: What is the practical question, in the context of a real-world
setting?

Formulate: What specific statistical operations does this problem call
for?

Solve: Make the graphs and carry out the calculations needed for this
problem.

Conclude: Give your practical conclusion in the setting of the real-world
setting.