Download Chapt1.3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Categorical variable wikipedia , lookup

Regression toward the mean wikipedia , lookup

Transcript
Measures of
Center
SUPPOSE THAT AN INSTRUCTOR IS TEACHING TWO SECTIONS OF A
COURSE AND THAT SHE CALCULATES THE MEAN EXAM SCORE TO BE
60 FOR SECTION 1 AND 90 FOR SECTION 2
A. Do you have enough information to determine the mean exam score
for the two sections combined? Explain
B. What can you say with certainty about the value of the overall mean
for the two sections combined?
C. Without seeing all of the individual students’ exam scores, what
information would you need to be able to calculate the overall mean?
D. Suppose that section 1 contains 20 students and section 2 contains 30
students. Calculate the overall mean exam score. Is the overall mean
closer to 60 or 90?
E. Give an example of sample sizes for the two sections for which the
overall mean turns out to be less than 65.
F. If you do not know the number of students in the sections but do
know that there is the same number of students in the 2 sections, can
you determine the overall mean?
G. Explain how it could happen that a student could transfer from section
1 to section 2 and cause the mean score for each section to decrease.
Measures of Central Tendency
• Median - the middle of the data;
•
50th percentile
–Observations must be in numerical
order
–Find the middle single value if n is
odd
–Take the average of the middle two
values if n is even
NOTE: n denotes the sample size
Finding the Center: The Median
• The median is the value with exactly half the
data values below it and half above it.
– It is the middle data
value (once the data
values have been
ordered) that divides
the histogram into
two equal areas.
– It has the same
units as the data.
Slide 5- 4
Measures of Central Tendency
parameter
• Mean - the arithmetic average
–Use m to represent a population
statistic
mean
–Use to x̄ represent a sample mean
x

x

x
n
Mean
• Regardless of the
shape of the
distribution, the
mean is the point
at which a
histogram of the
data would
balance; the
median is the
equal area point.
Slide 5- 6
Measures of Central Tendency
• Mode – the observation that occurs
the most often
–Can be more than one mode
–If all values occur only once – there
is no mode
–Not used as often as mean &
median
Another Measure of Center
• As a measure of center, the midrange may also
be used (the average of the minimum and
maximum values). However it is very sensitive
to skewed distributions and outliers.
• The median is a more reasonable choice for
center than the midrange in skewed
distributions.
Slide 5- 8
Using the calculator . . .
Enter
the data in a list
Go to LIST Menu
Highlight MATH
Find your function
OR
Go to Stat Menu
Highlight Calc
Run 1-Vars Stats on your list
• Measuring Center
Example, page 53
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
10  30  5  25  ... 40  45
x
 31.25 minutes
20
0
1
2
3
4
5
6
7
8
5
005555
0005
Key: 4|5
00
represents a
005
005
5
New York
worker who
reported a 45minute travel
time to work.
20  25
M
 22.5 minutes
2
Describing Quantitative Data
– Use the data below to calculate the mean and median of the
commuting times (in minutes) of 20 randomly selected New
York workers.
Suppose we are interested in the number of
lollipops that are bought at a certain store. A
sample of 5 customers buys the following number
of lollipops. Find the median.
The numbers are in order
& n is odd – so find the
middle observation.
2
The median is 4
lollipops!
3 4 8 12
What would happen to the median & mean if the
12 lollipops were 20?
The median is . . .
The mean is . . .
5
7.17
2  3  4  6  8  20
6 What happened?
2
3 4 6 8 20
What would happen to the median & mean if the
20 lollipops were 50?
The median is . . .
The mean is . . .
5
12.17
2  3  4  6  8  50
6 What happened?
2
3 4 6 8 50
Resistant • Statistics that are not affected by
extreme values (outliers)
• Is the median resistant?
►Is
the mean resistant?
YES
NO
Look at the following data set. Find the
mean & median.
Mean = 27
Median = 27
21
27
Create a histogram with the
data.
x-scale
of 2) Then
Look(use
at the
placement
of
find
mean
median.
thethe
mean
andand
median
in
this symmetrical
distribution.
23
23
24
25
25
27
27
28
30
30
26
26
26
27
30
31
32
32
Look at the following data set. Find the
mean & median.
Mean = 28.176
Median = 25
22
23
Look at the placement of
the mean and median in
this right skewed
29 distribution.
28
22
24
24
23
26
36
25
28
21
38
62
23
25
Look at the following data set. Find the
mean & median.
Mean = 54.588
Median = 58
Create a histogram with the
data.
Then
findplacement
the meanof
and
Look
at the
median.
the mean
and median in
this skewed left
distribution.
21
46
54
47
53
60
55
55
56
63
64
58
58
58
58
62
60
Comparing the mean and the median
The mean and the median are the same only if the distribution is
symmetrical. Even in a skewed distribution, the median remains at the
center point, the mean however, is pulled in the direction of the skew.
Mean and median for a
symmetric distribution
Mean
Median
Left skew
Mean
Median
Mean and median for
skewed distributions
Mean
Median
Right skew
WHICH MEASURE OF CENTER?
► Given
that the
MEAN is a
NON-RESISTANT
measure, it makes
sense to use the
MEDIAN in a
skewed distribution
as the “more
typical” value
► Ex.
Consider the
following test
scores:
► 96 98 92 90 95
100 91 55
► Find
the mean &
the median
► Which one is more
“typical”?
Trimmed mean:
To calculate a trimmed mean:
• Multiply the % to trim by n
• Truncate that many observations from
BOTH ends of the distribution (when
listed in order)
• Calculate the mean with the shortened
data set
First find the mean of the data then find a 10%
trimmed mean with the following data.
12
14
19
20
22
24
25
26
26
10%(10) = 1
So remove one observation
from each side!
14  19  20  22  24  25  26  26
 22
8
55
WEIGHTED MEAN
• Midterm ---
92
• Paper ----
80
• Final ---
88
.25(92) + .25(80) + .5(88) =
• Find your semester
average if the
Midterm is
weighed 25%, the
paper 25% & the
Final 50%
WEIGHTED MEAN
Weighted Mean is an average computed
by giving different weights to some of the
individual values. If all the weights are
equal, then the weighted mean is the same
as the arithmetic mean.
x is each data value
w is the number of occurrences of x
(weight)
x̄ is the weighted mean
CONSIDER THE FOLLOWING 3 SAMPLE DATA SETS:
I
20
40
50
30
60
70
II
47
43
44
46
20
70
III
44
43
40
50
46
47
COMPUTE THE RANGE, MEDIAN & MEAN FOR EACH DATA SET
WHAT DO YOU NOTICE???
NOW TAKE A LOOK AT COMPARING THE
DOT PLOTS
Why is the study of variability
important?
• Allows us to distinguish between usual &
unusual values
• In some situations, want more/less
variability
• When describing data, never rely on
center alone
• Like Measures of Center, you must
choose the most appropriate measure of
spread.
Measures of Variability
• range (max-min)
• interquartile range (Q3-Q1)
• deviations  x  x  Lower case
Greek letter
2
sigma
• variance  
• standard deviation  

A measure of center alone can be misleading.
A useful numerical description of a distribution requires both a
measure of center and a measure of spread.
How to Calculate the Quartiles and the Interquartile Range
To calculate the quartiles:
1)Arrange the observations in increasing order and locate the
median M.
2)The first quartile Q1 is the median of the observations
located to the left of the median in the ordered list.
3)The third quartile Q3 is the median of the observations
located to the right of the median in the ordered list.
The interquartile range (IQR) is defined as:
IQR = Q3 – Q1
Describing Quantitative Data

Spread: The Interquartile Range (IQR)
+
 Measuring
and Interpret the IQR
+
 Find
Travel times to work for 20 randomly selected New Yorkers
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
5
10
10
15
15
15
15
20
20
20
25
30
30
40
40
45
60
60
65
85
Q1 = 15
M = 22.5
Q3= 42.5
IQR = Q3 – Q1
= 42.5 – 15
= 27.5 minutes
Interpretation: The range of the middle half of travel times for the
New Yorkers in the sample is 27.5 minutes.
Describing Quantitative Data
Example, page 57
In addition to serving as a measure of spread, the
interquartile range (IQR) is used as part of a rule of thumb
for identifying outliers.
Definition:
The 1.5 x IQR Rule for Outliers
Call an observation an outlier if it falls more than 1.5 x IQR above the
third quartile or below the first quartile.
Example, page 57
In the New York travel time data, we found Q1=15
minutes, Q3=42.5 minutes, and IQR=27.5 minutes.
For these data, 1.5 x IQR = 1.5(27.5) = 41.25
Q1 - 1.5 x IQR = 15 – 41.25 = -26.25
Q3+ 1.5 x IQR = 42.5 + 41.25 = 83.75
Any travel time shorter than -26.25 minutes or longer
than 83.75 minutes is considered an outlier.
0
1
2
3
4
5
6
7
8
5
005555
0005
00
005
005
5
Describing Quantitative Data

Outliers
+
 Identifying
+
Five-Number Summary

The minimum and maximum values alone tell us little about
the distribution as a whole. Likewise, the median and quartiles
tell us little about the tails of a distribution.

To get a quick summary of both center and spread, combine
all five numbers.
Definition:
The five-number summary of a distribution consists of the
smallest observation, the first quartile, the median, the third
quartile, and the largest observation, written in order from
smallest to largest.
Minimum
Q1
M
Q3
Maximum
Describing Quantitative Data
 The

The five-number summary divides the distribution roughly into
quarters. This leads to a new way to display quantitative data,
the boxplot.
How to Make a Boxplot
•Draw and label a number line that includes the
range of the distribution.
•Draw a central box from Q1 to Q3.
•Note the median M inside the box.
•Extend lines (whiskers) from the box out to the
minimum and maximum values that are not outliers.
+
Boxplots (Box-and-Whisker Plots)
Describing Quantitative Data

a Boxplot
+
 Construct
Consider our NY travel times data. Construct a boxplot.

10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
5
10
10
15
15
15
15
20
20
20
25
30
30
40
40
45
60
60
65
85
Min=5
Q1 = 15
M = 22.5
Q3= 42.5
Max=85
Recall, this is
an outlier by the
1.5 x IQR rule
Describing Quantitative Data
Example
When we use the mean instead of the
median as a measure of center, we need
another way to measure spread.
Suppose that we have these data values:
24
34
26
30
28
21
35
29
37
16
First find the mean:
Then find the deviations. x  x 
What is the sum of the deviations from the
mean?
24
34
26
30
28
21
35
29
37
16


x

x
Square the deviations:
2
Find the average of the squared
deviations:
 x  x 
2
n
The average of the deviations
squared is called the variance.
Population parameter

2
Sample
s
2
statistic
Calculation of variance
of a sample
  xn  x 
s 
n 1
2
2
df
Degrees of Freedom (df)
• n deviations contain (n - 1)
independent pieces of
information about
variability
• Measuring Spread: The Standard Deviation
Definition:
(x1  x ) 2  (x 2  x ) 2  ... (x n  x ) 2
1
variance = s 

(x i  x ) 2

n 1
n 1
2
x
1
2
standard deviation = sx 
(x

x
)

i
n 1
Describing Quantitative Data
The standard deviation sx measures the average distance
of the observations from their mean. It is calculated by
finding an average of the squared distances and then taking
the square root.
Using a Calculator:
• ENTER DATA IN L1
1-Vars Stats on L1 or use List menu
option
Which measure(s) of
variability is/are
resistant?
• Choosing Measures of Center and Spread
– Mean and Standard Deviation
– Median and Interquartile Range
•The median and IQR are usually better than the mean and standard
deviation for describing a skewed distribution or a distribution with
outliers.
•Use mean and standard deviation only for reasonably symmetric
distributions that don’t have outliers.
•NOTE: Numerical summaries do not fully describe the shape of a
distribution. ALWAYS PLOT YOUR DATA!
Describing Quantitative Data
• We now have a choice between two
descriptions for center and spread
COEFFICIENT OF VARIATION:
a measurement of the relative
variability (or consistency) of data
s

CV   100 or
 100
x
m
CV is used to
compare variability or consistency
A sample of newborn infants had a mean weight of
6.2 pounds with a standard deviation of 1 pound.
A sample of three-month-old children had a mean
weight of 10.5 pounds with a standard deviation of
1.5 pounds.
Which (newborns or 3-month-olds) are more
variable in weight?
To compare variability, compare
Coefficient of Variation
For newborns:
For 3-montholds:
CV = 16%
Higher CV:
more variable
CV = 14% Lower CV:
more consistent
Use Coefficient of Variation
To compare two groups of data,
to answer:
Which is more consistent?
Which is more variable?
Linear Transformations
Variables can be measured in different units
(feet vs meters, pounds vs kilograms, etc)
When converting units, the measures of center
and spread will change.
Linear transformation rule
• When multiplying or adding a constant to a
random variable, the mean changes by both.
• When multiplying or adding a constant to a
random variable, the standard deviation
changes only by multiplication.
• Formulas:
max b  amx  b
 ax b  a x
An appliance repair shop charges a $30 service call
to go to a home for a repair. It also charges $25 per
hour for labor. From past history, the average length
of repairs is 1 hour 15 minutes (1.25 hours) with
standard deviation of 20 minutes (1/3 hour).
Including the charge for the service call, what is the
mean and standard deviation for the charges for a
service call?
m  30  25(1.25)  $61.25
1
  25   $8.33
3
Chapter 1 Summary
Data Analysis is the art of describing data in
context using graphs and numerical
summaries. The purpose is to describe the
most important features of a dataset.