Download Standard Deviation and

Document related concepts
no text concepts found
Transcript
Copyright © 2004 Pearson Education, Inc.
Chapter 2
Descriptive Statistics
Describe, Explore, and Compare Data
2-1
Overview
2-2
Frequency Distributions
2-3
Visualizing Data
2-4
Measures of Center
2-5
Measures of Variation
2-6
Measures of Relative Standing
2-7
Exploratory Data Analysis
Copyright © 2004 Pearson Education, Inc.
Section 2-1
Overview
Created by Tom Wegleitner, Centreville, Virginia
Copyright © 2004 Pearson Education, Inc.
Overview
 Descriptive Statistics
Describe the important characteristics
of a set of data.
Organize, present and summarize data:
1. Graphically
2. Numerically
Copyright © 2004 Pearson Education, Inc.
Important Characteristics of
Quantitative Data
“Shape, Center, and Spread”
1. Center: A representative or average value that
indicates where the middle of the data set is located
2. Variation: A measure of the amount that the values
vary among themselves
3. Distribution: The nature or shape of the distribution
of data (such as bell-shaped, uniform, or skewed)
Copyright © 2004 Pearson Education, Inc.
Section 2-2 and 2-3
Frequency Distributions
And Visualizing Data
Created by Tom Wegleitner, Centreville, Virginia
Copyright © 2004 Pearson Education, Inc.
Frequency Distributions
And Histograms

Frequency Distribution
Table that organizes data values into classes
along with the number of data values that fall in
each class (frequency, f).
1. Ungrouped Frequency Distribution – for data
sets with few different values. Each value is in its
own class.
2. Grouped Frequency Distribution: for data sets
with many different values, which are grouped
together in the classes.
Copyright © 2004 Pearson Education, Inc.
Ungrouped Frequency
Distributions
Number of Peas in a Pea Pod
Sample Size: 50
5
5
4
6
4
3
7
6
3
5
6
5
4
5
5
6
2
3
5
5
5
5
7
4
3
4
5
4
5
6
5
1
6
2
6
6
6
6
6
4
4
5
4
5
3
5
5
7
6
5
Peas per pod
Freq, f
Copyright © 2004 Pearson Education, Inc.
Peas per pod
Freq, f
1
1
2
2
3
5
4
9
5
18
6
12
7
3
Frequency Histogram
A bar graph that represents the frequency
distribution of a data set. It has the
following properties:
1. Horizontal scale is quantitative and
measures the data values.
2. Vertical scale measures the
frequencies of the classes.
3. Consecutive bars must touch.
Copyright © 2004 Pearson Education, Inc.
Frequency Histogram
Ex. Peas per Pod
Freq, f
1
1
2
2
3
5
4
9
Number of Peas in a Pod
20
15
Freq, f
Peas per pod
10
5
5
18
0
6
12
1
2
3
4
5
Number of Peas
7
3
Copyright © 2004 Pearson Education, Inc.
6
7
Relative Frequency Distributions
and Relative Frequency Histograms

Relative Frequency Distribution
Shows the proportion (or percentage) of data
values that fall into each class
relative frequency: rf = f/n

Relative Frequency Histogram
Has the same shape and horizontal scale as a
histogram, but the vertical scale is marked with
relative frequencies.
Copyright © 2004 Pearson Education, Inc.
Relative Frequency Histogram
Has the same shape and horizontal scale as a
histogram, but the vertical scale is marked with
relative frequencies.
Figure 2-2
Copyright © 2004 Pearson Education, Inc.
Grouped Frequency
Distributions
Group data into 5-20 classes of equal
width.
Exam Scores
Freq, f
30-39
1
40-49
0
50-59
4
60-69
9
70-79
13
80-89
10
90-99
3
Copyright © 2004 Pearson Education, Inc.
Definitions

Lower class limits: are the smallest numbers
that can actually belong to different classes

Upper class limits: are the largest numbers
that can actually belong to different classes

Class width: is the difference between two
consecutive lower class limits or two consecutive lower
class boundaries

Class midpoints: the value halfway between
LCL and UCL

Class boundaries: the value halfway between
an UCL and the next LCL
Copyright © 2004 Pearson Education, Inc.
Constructing a Grouped Frequency Table
1.
Calculate the range of values to span the set: Range = Hi – Low.
(May round up)
2.
Decide on the number of classes (should be between 5 and 20) .
3.
Calculate class width:
(May round up)
4.
Choose the 1st LCL (less than or equal to smallest value)
5.
Write all LCLs by adding the class width.
6.
Enter all the UCLs.
7.
Find the frequencies for each class.
class width 
(highest value) – (lowest value)
number of classes
Copyright © 2004 Pearson Education, Inc.
“Shape” of Distribution
 Symmetric
Data is symmetric if the left half of its
histogram is roughly a mirror image of
its right half.
 Skewed
Data is skewed if it is not symmetric
and if it extends more to one side than
the other.
 Uniform
Data is uniform if it is equally distributed (on
a histogram, all the bars are the same
height).
Copyright © 2004 Pearson Education, Inc.
Shape
Figure 2-11
Copyright © 2004 Pearson Education, Inc.
Outliers
are “unusal” data values as
compared to the rest of the set. They
may be distinguished by gaps in a
histogram.
 Outliers
Copyright © 2004 Pearson Education, Inc.
Other Graphs
 Besides histograms, there are other
ways to graph quantitative data:
1. Stem and Leaf plots
2. Dot plots
3. Time Series
Copyright © 2004 Pearson Education, Inc.
Stem-and Leaf Plot
Represents data by separating each value into two
parts: the stem (such as the leftmost digit) and the
leaf (such as the rightmost digit)
Copyright © 2004 Pearson Education, Inc.
Dot Plot
Consists of a graph in which each data value is
plotted as a point along a scale of values
Figure 2-5
Copyright © 2004 Pearson Education, Inc.
Time-Series Graph
Data that have been collected at different points in
time.
Figure 2-8
Ex. www.eia.doe.gov/oil_gas/petroleum/
Copyright © 2004 Pearson Education, Inc.
Qualitative Data
 The two most common graphs for
qualitative data are:
1. Pareto Charts (Bar charts)
2. Pie Charts
Copyright © 2004 Pearson Education, Inc.
Pareto Chart
A bar graph for qualitative data, with the bars
arranged in order according to frequencies
Figure 2-6
Copyright © 2004 Pearson Education, Inc.
Pie Chart
A graph depicting qualitative data as slices pf a pie
Figure 2-7
Copyright © 2004 Pearson Education, Inc.
Section 2-4
Measures of Center
Created by Tom Wegleitner, Centreville, Virginia
Copyright © 2004 Pearson Education, Inc.
Measures of Center
 Measure of Center
Number representing a “typical” or
central value of a data set. An “average”.
There are 4 common “averages”:
1. Mean
2. Median
3. Mode
4. Midrange
Copyright © 2004 Pearson Education, Inc.
The Mean
Mean: the measure of center
obtained by adding the values and
dividing the total by the number of
values.
Copyright © 2004 Pearson Education, Inc.
Notation

denotes the addition of a set of values
x
is the variable usually used to represent the individual
data values
n
represents the number of values in a sample
N
represents the number of values in a population
Copyright © 2004 Pearson Education, Inc.
Notation
x is pronounced ‘x-bar’ and denotes the mean of a set
of sample values
x
x =
n
µ is pronounced ‘mu’ and denotes the mean of all values
in a population
µ =
x
N
Copyright © 2004 Pearson Education, Inc.
Round-off Rule for
Measures of Center
Carry one more decimal place than is
present in the original set of values.
Copyright © 2004 Pearson Education, Inc.
Median
 Median
the middle value when the original
data values are arranged in order of
increasing (or decreasing) magnitude
 often denoted by x~
(pronounced ‘x-tilde’)
 is not affected by an extreme value
Copyright © 2004 Pearson Education, Inc.
Finding the Median
 If the number of values is odd, the
median is the number located in the
exact middle of the list
If the number of values is even, the
median is found by computing the
mean of the two middle numbers
Copyright © 2004 Pearson Education, Inc.
2
5
6
11
13
odd number of values:
median is the exact middle value
MEDIAN is 6
2
5
6
9
11
13
even number of values:
median is the mean of the by two numbers
6+9
2
MEDIAN is 7.5
Copyright © 2004 Pearson Education, Inc.
Mode
 Mode: the value that occurs most frequently.
The mode is not always unique. A data set may
be:
Bimodal
Multimodal
No Mode
example:
a. 5.40 1.10 0.42 0.73 0.48 1.10
Mode is 1.10
b. 27 27 27 55 55 55 88 88 99
Bimodal -
c. 1 2 3 6 7 8 9 10
No Mode
Copyright © 2004 Pearson Education, Inc.
27 & 55
Midrange
Midrange: the value midway between
the highest and lowest values in the
Original data set.
Midrange =
highest score + lowest score
2
Copyright © 2004 Pearson Education, Inc.
Best Measure of Center
Copyright © 2004 Pearson Education, Inc.
Picking the best “average”
 The shape of your data may help
determine the best measure of center.
 Outliers may effect the mean, making it
too high or too low to represent a
“typical” value. If so, the median may
be the best choice.
Copyright © 2004 Pearson Education, Inc.
Shape
Figure 2-11
Copyright © 2004 Pearson Education, Inc.
Section 2-5
Measures of Variation
Created by Tom Wegleitner, Centreville, Virginia
Copyright © 2004 Pearson Education, Inc.
Measures of Variation
“Spread”
Because this section introduces the concept
of variation, this is one of the most important
sections in the entire book.
The two most common methods of
measuring spread:
1. Range
2. Standard deviation and variance
Copyright © 2004 Pearson Education, Inc.
Definition
The range of a set of data is the
difference between the highest
value and the lowest value
highest
value
lowest
value
Copyright © 2004 Pearson Education, Inc.
Standard Deviation and Variance
measure the amount data values vary
(or deviate) from the mean.
sample variance:
2
S
 (x - x)
=
n-1
sample standard deviation:
S=
s
2
=
 (x - x)
n-1
Copyright © 2004 Pearson Education, Inc.
2
2
Round-off Rule
for Measures of Variation
Carry one more decimal place than
is present in the original set of
data.
Round only the final answer, not values in
the middle of a calculation.
Copyright © 2004 Pearson Education, Inc.
Notation
Sample
Population
Statistics Parameters
Mean
x
µ
Standard
Deviation
s
σ
Variance
s2
σ2
Copyright © 2004 Pearson Education, Inc.
Sample vs. Population Standard
Deviation
Note: Unlike x and µ, the formulas for s and σ
are not mathematically the same:
s=
 =
 (x - x)
n-1
2
 (x - µ)
2
N
Copyright © 2004 Pearson Education, Inc.
Standard Deviation Key Points

s0
( When would s = 0 ?)
 The standard deviation is a measure of variation of
all values from the mean.
The larger s is, the more the data varies.
 The units of the standard deviation s are the same
as the units of the original data values
(The variance has units2).
 The value of the standard deviation s can increase
dramatically with the inclusion of one or more
outliers (data values far away from all others)
Copyright © 2004 Pearson Education, Inc.
Standard Deviation and “Spread”
How does “s” show how much the
data varies?
Three methods:
1. Range Rule of Thumb
2. Chebyshev’s Theorem
3. The Empirical Rule
Copyright © 2004 Pearson Education, Inc.
The Range Rule of Thumb
Range Rule: For most data sets, the majority of
the data lies within 2 standard deviations of the
mean.
Recall: Range = High – Lo
Estimate: Range ≈ 4s
Alternatively, If the range is known, you can use the
range rule to estimate the standard deviation:
s
Range
4
Copyright © 2004 Pearson Education, Inc.
Chebyshev’s Theorem
Chebyshev’s Theorem
For data with any distribution, the proportion (or
fraction) of any set of data lying within K standard
deviations of the mean is always at least 1-1/K2, where
K is any positive number greater than 1.
 For K = 2, at least 3/4 (or 75%) of all values lie
within 2 standard deviations of the mean
 For K = 3, at least 8/9 (or 89%) of all values lie
within 3 standard deviations of the mean
Copyright © 2004 Pearson Education, Inc.
The Empirical Rule
Empirical (68-95-99.7) Rule
For data sets having a symmetric distribution:
 About 68% of all values fall within 1 standard
deviation of the mean
 About 95% of all values fall within 2 standard
deviations of the mean
 About 99.7% of all values fall within 3 standard
deviations of the mean
Copyright © 2004 Pearson Education, Inc.
The Empirical Rule
Copyright © 2004 Pearson Education, Inc.
The Empirical Rule
Copyright © 2004 Pearson Education, Inc.
The Empirical Rule
Copyright © 2004 Pearson Education, Inc.
Section 2-6 and 2-7
Measures of Position
(Relative Standing)
Created by Tom Wegleitner, Centreville, Virginia
Copyright © 2004 Pearson Education, Inc.
Measures of Position
Sometimes we want to know the “relative
standing” or “relative position” of a
particular data value in the set.
Some measures of position:
1. Standard Scores (z-scores*)
2. Median, Quartiles, Percentiles
Copyright © 2004 Pearson Education, Inc.
z-score
 The z-score (or standard score) for a data
value x is the number of standard
deviations that x is above or below the
mean.
Copyright © 2004 Pearson Education, Inc.
Computing z-scores
 To convert a data value x to a z-score:
Sample:
x
x
z= s
Population
x
µ
z=

Round to 2 decimal places
Copyright © 2004 Pearson Education, Inc.
Interpreting Z Scores
FIGURE 2-14
Whenever a value is less than the mean, its
corresponding z score is negative
Ordinary values:
z score between –2 and 2 sd
Unusual Values:
z score < -2 or z score > 2 sd
Copyright © 2004 Pearson Education, Inc.
Other Measures of
Position
 Median
 Quartiles
 Percentiles
Recall: The median separates ranked
data into 2 equal parts.
Copyright © 2004 Pearson Education, Inc.
Quartiles
Quartiles separate ranked data into 4 equal
parts:
Q1 (First Quartile) separates the bottom
25% of sorted values from the top 75%.
 Q2 (Second Quartile) same as the median;
separates the bottom 50% of sorted
values from the top 50%.
 Q1 (Third Quartile) separates the bottom
75% of sorted values from the top 25%.
Copyright © 2004 Pearson Education, Inc.
Quartiles
Q1, Q2, Q3
divides ranked scores into four equal parts
25%
Low
25%
25% 25%
Q1 Q2 Q3
(High)
(median)
Copyright © 2004 Pearson Education, Inc.
Percentiles
Just as there are quartiles separating data
into four parts, there are 99 percentiles
denoted P1, P2, . . . P99, which partition the
data into 100 groups.
Copyright © 2004 Pearson Education, Inc.
Tukey’s 5-number Summary
 Tukey’s 5-number summary:
Low
Q1
Median
Q3
High
These 5 numbers can also give
another representation of “center and
spread.”
Copyright © 2004 Pearson Education, Inc.
Boxplots
 A Boxplot (or Box & Whisker plot) is a graphical
representation of Tukey’s 5-number summary.
example:
Figure 2-16
Copyright © 2004 Pearson Education, Inc.
Boxplots
Figure 2-17
Copyright © 2004 Pearson Education, Inc.