Download Sum - Images

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Measures of Central Tendency
and Spread
Chapter 1, Section 2
The Motivation
• Measure of central tendency are used to
describe the typical member of a
population.
• Depending on the type of data, typical could
have a variety of “best” meanings.
• We will discuss four of these possible
choices.
3 Measures of Central Tendency
• Mean – the arithmetic average. This is used for continuous
data.
• Median – a value that splits the data into two halves, that
is, one half of the data is smaller than that number, the
other half larger. May be used for continuous or ordinal
data.
• Mode – this is the category that has the most data. As the
description implies it is used for categorical data.
Mean
• To find the mean, add all
of the values, then divide
x
by the number of values.

Population
• The lower case, Greek
N
letter mu is used for
x
population mean.
x
Sample
n
• An “x” with a bar over
it, read x-bar, is used for
sample mean.


Mean Example
listing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
n = 15
total
X
14
17
31
28
42
43
51
51
66
70
67
70
78
62
47
737
737/15 =
x-bar
49.13333
Median
• The median is a number chosen so that half of the
values in the data set are smaller than that number,
and the other half are larger.
• To find the median
– List the numbers in ascending order
– If there is a number in the middle (odd number of
values) that is the median
– If there is not a middle number (even number of values)
take the two in the middle, their average is the median
Median Example
listing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
X
14
17
28
31
42
43
47
51
51
62
66
67
70
70
78
listing
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
X
14
17
28
31
42
43
47
51
53
57
62
66
67
70
70
78
51+53
2
= 52
Mode
• The mode is simply the category or value which
occurs the most in a data set.
• If a category has radically more than the others, it
is a mode.
• Generally speaking we do not consider more than
two modes in a data set.
• No clear guideline exists for deciding how many
more entries a category must have than the others
to constitute a mode.
Obvious Example
80
70
60
thousands
• There is
obviously more
yellow than red
or blue.
• Yellow is the
mode.
• The mode is the
class, not the
frequency.
Beach Ball Production
50
40
30
20
10
0
blue
red
yellow
Bimodal
Geometry Scores For TASP
120
100
80
60
40
20
0
very bad
bad
neutral
good
very good
No Mode
Category
Frequency
1
51
70
2
51
60
3
66
50
4
62
40
30
5
65
20
6
57
10
7
47
0
8
43
1
9
64
•
Although the third category is the
largest, it is not sufficiently
different to be called the mode.
2
3
4
5
6
7
8
9
Quartiles, Percentiles and Other
Fractiles
• We will only consider the quartile, but the same
concept is often extended to percentages or other
fractions.
• The median is a good starting point for finding the
quartiles.
• Recall that to find the median, we wanted to locate
a point so that half of the data was smaller, and the
other half larger than that point.
Quartile
• For quartiles, we want to divide our data
into 4 equal pieces.
Suppose we had the following data set (already in order)
2 3 7 8 8 8 9 13 17 20 21 21
Choosing the numbers 7.5, 8.5, and 18.5 as markers would
Divide the data into 4 groups, each with three elements.
These numbers would be the three quartiles for this data set.
Quartiles Continued
• Conceptually, this is easy, simply find the median, then
treat the left hand side as if it were a data set, and find its
median; then do the same to the right hand side.
• This is not always simple. Consider the following data set.
• 3333356888889
• The first difficulty is that the data set does not divide
nicely.
• Using the rules for finding a median, we would get
quartiles of 3, 6 and 8.
• The second difficulty is how many of the 3’s are in the first
quartile, and how many in the second?
Quartiles Continued
• For this course, let’s pretend that this is not
an issue.
• I will give you the quartiles.
• I will not ask how many are in a quartile.
Stem and Leaf Plot
• Take a data set:
• Put data in order from least to greatest
• Make a list of the stems, inclusive of all
from least to greatest
• Fill in the leaves
• Make a key
Stem and Leaf
• Data: 11, 13, 16, 16, 16, 19, 20, 41, 43
• Stems: 1, 2, 3, and 4, even though no data
in the 30s.
53 1 136669
• Compare data:
7432
2
0
• 13, 15, 22, 23, 25, 4 3 1 3
27, 31, 33, 34, 46
6
4
13
Key: 1/3 = 13
You try:
•
•
•
•
•
Data set 1:
3, 5, 18, 19, 19, 20, 30, 56, 57, 58, 58
Data set 2:
17, 17, 18, 18, 18, 19, 29, 29, 29, 59
Write two things you notice from this stem
and leaf plot.
Answer
0 35
988877
999
1 899
2 0
3 0
4
9
5 6788
Key: 0/3 = 3
Note about stems and leaves
• Given data:
465, 466 470, 489, …
Key:
46/5 = 465
Given data:
0.95, 0.99, 0.89, 1.03, 1.09
Key:
10/3 = 1.03
46
56
47
0
48
9
08
9
09
59
10
39
Box and Whisker Plot
• Put data in order.
• Make a number line containing minimum
and maximum points
• Mark min and max with dots.
• Mark the median with a line.
• Mark the quartiles with a line.
• Make a box, and connect the whiskers,
Box and Whisker Plot
3, 5, 18, 19, 19, 20, 30, 56, 57, 58, 58
17, 17, 18, 18, 18, 19, 29, 29, 29, 59
0
5 10
20
30
40
50
60
70
Different Distributions
• Consider the range of the data (the minimum point
to the maximum point).
• If there is no mode, then the distribution is
relatively uniform.
• If the mean, median, and mode are about equal,
then the distribution is roughly “normal”.
• If the mean and median are not roughly equal,
then the distribution is “skewed”.
Standard Deviation
The Standard Deviation is
a number that measures
how far away each number
in a set of data is from their
mean.
If the Standard Deviation is large,
large,
it means the numbers are spread
out from their mean.
If the Standard Deviation is
small, it means the numbers are
small,
close to their mean.
Two classes took a
recent quiz. There were
10 students in each
class, and each class had
an average score of 81.5
Since the averages are the
same, can we assume that
the students in both classes
all did pretty much the
same on the exam?
The answer is… No.
The average (mean) does
not tell us anything about
the distribution or variation
in the grades.
Here are Dot-Plots of the grades
in each class:
Mean
So, we need to come up
with some way of
measuring not just the
average, but also the
spread of the distribution
of our data.
Why not just give an
average and the range of
data (the highest and
lowest values) to describe
the distribution of the
data?
Well, for example, lets say from
a set of data, the average is 17.95
and the range is 23.
But what if the data looked like
this:
Here is the average
And here is the range
But really, most of
the numbers are in
this area, and are not
evenly distributed
throughout the
range.
Here are
the scores
on the math
quiz for
Team A:
72
76
80
80
81
83
84
85
85
89
Average:
81.5
The Standard Deviation measures how far away each number
in a set of data is from their mean.
For example, start with the lowest score, 72. How far away is
72 from the mean of 81.5?
72 - 81.5 = - 9.5
- 9.5
Or, start with the lowest score, 89. How far away is 89 from
the mean of 81.5?
89 - 81.5 = 7.5
- 9.5
7.5
So, the first
step to
finding the
Standard
Deviation is
to find all
the
distances
from the
mean.
Distance
from Mean
72
76
80
80
81
83
84
85
85
89
-9.5
7.5
So, the first
step to
finding the
Standard
Deviation is
to find all
the
distances
from the
mean.
Distance
from Mean
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distance
from Mean
Next, you
need to
square each
of the
distances to
turn them
all into
positive
numbers
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distances
Squared
90.25
30.25
Distance
from Mean
Next, you
need to
square each
of the
distances to
turn them
all into
positive
numbers
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distances
Squared
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Distance
from Mean
Add up all
of the
distances
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distances
Squared
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Sum:
214.5
Distance
from Mean
Divide by (n
- 1) where n
represents the
amount of
numbers you
have.
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distances
Squared
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Sum:
214.5
(10 - 1)
= 23.8
Distance
from Mean
Finally,
take the
Square
Root of the
average
distance
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distances
Squared
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Sum:
214.5
(10 - 1)
= 23.8
= 4.88
Distance
from Mean
This is the
Standard
Deviation
72
76
80
80
81
83
84
85
85
89
- 9.5
- 5.5
- 1.5
- 1.5
- 0.5
1.5
2.5
3.5
3.5
7.5
Distances
Squared
90.25
30.25
2.25
2.25
0.25
2.25
6.25
12.25
12.25
56.25
Sum:
214.5
(10 - 1)
= 23.8
= 4.88
Distance
from Mean
Now find
the
Standard
Deviation
for the
other class
grades
57
65
83
94
95
96
98
93
71
63
- 24.5
- 16.5
1.5
12.5
13.5
14.5
16.5
11.5
- 10.5
-18.5
Distances
Squared
600.25
272.25
2.25
156.25
182.25
210.25
272.25
132.25
110.25
342.25
Sum:
2280.5
(10 - 1)
= 253.4
= 15.91
Now, lets compare the two
classes again
Team A
Average on
the Quiz
Standard
Deviation
Team B
81.5
81.5
4.88
15.91
5 Number Summary
• The five number summary is the minimum value, the three quartiles and
the maximum value.
• This may be represented graphically with a box and whisker plot.
Outliers
• Outliers are values in the data set which are either
suspiciously large or small.
• Such values may be the result of an error, the
researcher measures incorrectly or maybe the
results are typed incorrectly.
• Outliers may be good data. There is always the
chance that you have one basketball player in a set
of ordinary people.
• The seven foot height is not an error, but it is still
unusually large.
Interquartile Range
• One method for identifying these outliers,
involves the use of quartiles.
• The interquartile range (IQR) is Q3 – Q1.
• All numbers less than Q1 – 1.5(IQR) are
probably too small.
• All numbers greater than Q3 + 1.5(IQR) are
probably too large.
Using IQR to Find Outliers
The red lines are 1.5 times the IQR. Starting from Q1 going
left, and starting from Q3 going right 1.5(IQR) we establish
limits. All numbers smaller on the left, and larger on the right
are outliers.
Example
Linear Transformations
• When changing units, e.g., feet to meters,
degrees F to degrees C, we employ a linear
transformation.
– New = a + b Old
• Measures of both center and spread will be
multiplied by “b”.
• Only measures of location are affected by
“a”.