Download Book Chapter 3

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Categorical variable wikipedia , lookup

Student's t-test wikipedia , lookup

Time series wikipedia , lookup

Transcript
Chapter 3
NUMERICAL METHODS
FOR DESCRIBING DATA
DISTRIBUTIONS
Created by Kathy Fritz
Suppose that you have just received your
score on an exam in one of your classes.
What would you want to know about the
distribution of scores for this exam?
Measures of center
Measures of spread
The stress of the final years of medical training can contribute to
depression and burnout. The authors of the paper “Rates of
Medication Errors Among Depressed and Burnt Out Residents”
(British Medical Journal [2008]: 488) studied 24 residents in
pediatrics. Medical records of patients treated by these residents
during a fixed time period were examined for errors in ordering or
administering medications. The accompanying dotplot displays
the total number of medication errors for each of the 24 residents.
Choosing Appropriate
Measures for Describing
Center and Spread
If the shape of the data
distribution is …
Describe Center and
Spread Using …
Describing Center and Spread
For Data Distributions That Are
Approximately Symmetric
Mean
Standard Deviation
Mean
Definition:
In mathematics, the capital Greek letter Σ is short for “add them
all up.” Therefore, the formula for the mean can be written in
more compact notation:
Measuring Center
Use the data below to calculate the mean of the
commuting times (in minutes) of 20 randomly selected
New York workers.
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
Measuring Variability
Consider the three sets of six exam scores displayed
below:
Each data set has a mean exam score of 75.
Does that completely describe these data sets?
Range
Deviations
The most widely used measures of variability
50
60
70
80
90
100
Variance and Standard
Deviation
Suppose that we are interested in finding the “typical”
or average deviation from the mean.
So, to calculate the “typical” or average deviation
from the mean, we must first square each deviation.
Then the all the squared deviations are positive.
The deviations from the mean were -25, -15, -5, 5, 15,
and 25. The squares of these deviations from the
mean are
Now we can average these.
50
60
70
80
90
100
Variance and Standard
Deviation
Variance and Standard
Deviation

50
60
70
80
90
100
Variance and Standard
Deviation
Consider the following data on the number of pets owned by a
group of 9 children.
Measuring Spread: The Standard Deviation
xi
1
3
4
4
4
5
7
8
9
(xi-mean)
(xi-mean)2
Notation to remember
Putting it Together

Describing Center and Spread
For Data Distributions That Are
Skewed or Have Outliers
Median
Interquartile Range
Median
The median M
The sample median is obtained by first ordering
the n observations from smallest to largest (with
any repeated values included, so that every
sample observation appears in the ordered
list).
Then . . .
Forty students were enrolled in a statistical reasoning
course at a California college. The instructor made course
materials, grades, and lecture notes available to students
on a class web site. Course management software kept
track of how often each student accessed any of these
web pages. The data set below (in order from smallest to
largest) is the number of times each of the 40 students had
accessed the class web page during the first month.
0
0
0
0
0
0
3
4
4
4
5
5
7
7
8
8
8
12
12
13
13
13
14
14
16
18
19
19
20
20
21
22
23
26
36
36
37
42
84
331
Comparing the Mean and the Median

The mean and median measure center in different ways,
and both are useful.

Don’t confuse the “average” value of a variable (the mean)
with its “typical” value, which we might describe by the
median.
Comparing the Mean and the Median
Measuring Spread - Interquartile Range
Interquartile range (iqr) is based on quantities
called quartiles which divide the data set into
four equal parts (quarters).
Lower quartile (Q1) =
Upper quartile (Q3) =
In n is odd, the median of the entire data set is
excluded from both halves when computing quartiles.
Measuring Spread: The Interquartile Range

A measure of center alone can be misleading.

A useful numerical description of a distribution requires
both a measure of center and a measure of spread.
How to Calculate the Quartiles and the Interquartile
Range
To calculate the quartiles:
Recall the website data set:
0
0
0
0
0
0
3
4
4
4
5
5
7
7
8
8
8
12
12
13
13
13
14
14
16
18
19
19
20
20
21
22
23
26
36
36
37
42
84
331
The lower quartile (Q1) is the median of the lower 20
data values.
The upper quartile (Q3) is the median of the upper 20
data values.
The interquartile (iqr) is the difference of the
upper and lower quartile.
Putting it Together
The Chronicle of Higher Education (Almanac Issue,
2009-2010) published the accompanying data on the
percentage of the population with a bachelor’s
degree or graduate degree in 2007 for each of the 50
U.S. states and the District of Columbia. The data
distribution is shown in the histogram below.
Step 1: Select
Putting it Together
Step 2: Calculations
Step 3: Interpret

Find and Interpret the IQR
Travel times to work for 20 randomly selected New
Yorkers
10 30 5
25 40 20 10 15 30 20 15 20 85 15 65 15 60 60 40 45
Boxplots
General Boxplots
Modified Boxplots
Five-Number Summary
The five-number summary consists of the
following:
Boxplots
When to Use
Univariate numerical data
How to construct
What to look for
center, spread, and shape of the data distribution
and if there are any unusual features
Boxplot Example
Comparative Boxplots
A comparative boxplot is
Recall the video game study. There were two groups:
1) told to improve total score or 2) told to improve a
different aspect, such as speed.
1st
2nd
Identifying Outliers
In addition to serving as a measure of spread, the
interquartile range (IQR) is used as part of a rule of
thumb for identifying outliers.
Definition:
The 1.5 x IQR Rule for Outliers
In the New York travel time data, we found Q1=15 minutes,
Q3=42.5 minutes, and IQR=27.5 minutes.
Modified boxplots
How to construct
Compute the values in the five-number summary
2. Draw a horizontal line and add an appropriate scale.
3. Draw a box above the line that extends from the lower
quartile (Q1) to the upper quartile (Q3)
4. Draw a line segment inside the box at the location of
the median.
1.

Construct a Boxplot
Consider our NY travel times data. Construct a boxplot.
10
30
5
25
40
20
10
15
30
20
15
20
85
15
65
15
60
60
40
45
Big Mac prices in U.S. dollars for 44 different countries
were given in the article “Big Mac Index 2010”. The
following 44 Big Mac prices are arranged in order
from the lowest price (Ukraine) to the highest price
(Norway).
1.84
1.86
1.90
1.95
2.17
2.19
2.19
2.28
2.33
2.34
2.45
2.46
2.50
2.51
2.60
2.62
2.67
2.71
2.80
2.82
2.99
3.08
3.33
3.34
3.43
3.48
3.54
3.56
3.59
3.67
3.73
3.74
3.83
3,84
3.84
3.86
3.89
4.00
4.33
4.39
4.90
4.91
6.19
6.56
7.20
Big Mac Prices Continued . . .
Smallest observation =
Lower quartile =
Median =
Upper quartile =
Largest observation =
1
2
3
4
5
Big Mac Prices
6
7
8
The 2009-2010 salaries of NBA players published
on the web site hoopshype.com were used to
construct the comparative boxplot of salary
data for five teams.
Measures of Relative
Standing
z -scores
Percentiles
Percentiles
For a number r between 0 and 100, the rth
percentile is a value such that r percent of the
observations fall AT or BELOW that value.
This diagram illustrates the 90th percentile.
Measuring Position: Percentiles
One way to describe the location of a value in a
distribution is to tell what percent of observations are
less than it.
Definition:
Jenny earned a score of 86 on her test. How did she perform
relative to the rest of the class?
6
7
7
8
8
9
7
2334
5777899
00123334
569
03
In addition to weight and length, head
circumference is another measure of health in
newborn babies. The National Center for
Health Statistics reports the following summary
values for head circumference (in cm) at birth
for boys.
Head circumference (cm) 32.2 33.2 34.5
Percentile
5
10
25
35.8
50
37.0 38.2 38.6
75
90
What value of head circumference is
at the 75th percentile?
What is the median value of head
circumference?
95
z -scores
Definition:
The z -score tells you.
Measuring Position: z-Scores
Jenny earned a score of 86 on her test. The
class mean is 80 and the standard deviation
is 6.07. What is her standardized score?
Using z-scores for Comparison
We can use z-scores to compare the position of individuals in
different distributions.
Jenny earned a score of 86 on her statistics test. The class
mean was 80 and the standard deviation was 6.07. She earned
a score of 82 on her chemistry test. The chemistry scores had
a fairly symmetric distribution with a mean 76 and standard
deviation of 4. On which test did Jenny perform better relative
to the rest of her class?
What do these z-scores mean?
-2.3
1.8
Suppose that two graduating seniors, one a
marketing major and one an accounting major, are
comparing job offers. The accounting major has an
offer for $45,000 per year, and the marketing major
has an offer for $43,000 per year.
Accounting: mean = 46,000 standard deviation = 1500
Marketing: mean = 42,500 standard deviation = 1000
Density Curve
Definition:
A density curve is a curve that
A density curve describes the overall pattern of a
distribution. The area under the curve and above any
interval of values on the horizontal axis is the proportion of
all observations that fall in that interval.
The overall pattern of this histogram of
the scores of all 947 seventh-grade
students in Gary, Indiana, on the
vocabulary part of the Iowa Test of Basic
Skills (ITBS) can be described by a
smooth curve drawn through the tops of
the bars.
Normal Distributions
One particularly important class of density curves are the
Normal curves, which describe Normal distributions.
All Normal curves are
A Specific Normal curve is described by giving its
Two Normal curves, showing the mean µ and standard deviation σ.
Normal Distributions
Definition:
A Normal distribution is described by a Normal density curve.
Any particular Normal distribution is completely specified by
two numbers: its mean µ and standard deviation σ.
Normal distributions are good descriptions for some distributions of real
data.
Normal distributions are good approximations of the results of many
kinds of chance outcomes.
Many statistical inference procedures are based on Normal
distributions.
Empirical Rule
If the data distribution is mound shaped and
approximately symmetric, then . . .

Approximately 68% of the observations

Approximately 95% of the observations

Approximately 99.7% of the observations
are
Empirical Rule
This illustrates the percentages given by the
Empirical Rule.
The distribution of Iowa Test of Basic
Skills (ITBS) vocabulary scores for 7th
grade students in Gary, Indiana, is
close to Normal. Suppose the
distribution is N(6.84, 1.55).
a) Sketch the Normal density curve for
this distribution.
b) What percent of ITBS vocabulary
scores are less than 3.74?
c) What percent of the scores are
between 5.29 and 9.94?
Common Mistakes
Avoid these Common
Mistakes
1.
Watch out for categorical data that look
numerical! Often, categorical data is
coded numerically. For example gender
might be coded as 0 = female and 1 =
male, but this does not make gender a
numerical variable. Categorical data
CANNOT be summarized using the mean
and standard deviation or the median and
interquartile range.
Avoid these Common
Mistakes
2.
Measures of center don’t tell all. Although
measures of center, such as the mean and
the median, do give you a sense of what
might be typical value for a variable, this is
only one characteristic of a data set.
Without additional information about
variability and distribution shape, you don’t
really know much about the behavior of the
variable.
Avoid these Common
Mistakes
3.
Data distributions with different shapes can
have the same mean and standard
deviation. For example, consider the
following two histograms:
Both histograms have the same mean of 10 and
standard deviation of 2, but VERY different shapes.
Avoid these Common
Mistakes
4.
Both the mean and the standard deviation
are sensitive to extreme values in a data set,
especially if the sample size is small.
If the data distribution is markedly skewed or if
the data set has outliers, the median and
interquartile range are a better choice for
describing center and spread.
Avoid these Common
Mistakes
5.
Measures of center and measures of
variability describe values of a variable, not
frequencies in a frequency distribution or
heights of bars in a histogram. For example,
consider the following two frequency
distributions and histograms:
Avoid these Common
Mistakes
6.
Be careful with boxplots based on small
sample sizes. Boxplots convey information
about center, variability, and shape, but
interpreting shape information is
problematic when the sample size is small.
Avoid these Common
Mistakes
7.
Not all distributions are mound shaped.
Using the Empirical Rule in situations where
you are not convinced that the data
distribution is mound shaped and
approximately symmetric can lead to
incorrect statements.
Avoid these Common
Mistakes
8.
Watch for outliers! Unusual observations in a
data set often provide important
information about the variable under study,
so it is important to consider outliers in
addition to describing what is typical.
Outliers can also be problematic because
the values of some summaries are
influenced by outliers and because some
methods for drawing conclusions from
data are not appropriate if the data set
has outliers.