Download Standard Deviation

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

World Values Survey wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 3
Data Description
© Copyright McGraw-Hill 2004
1
Objectives
•
Summarize data using measures of central
tendency, such as the mean, median, mode,
and midrange.
•
Describe data using the measures of
variation, such as the range, variance, and
standard deviation.
•
Identify the position of a data value in a data
set using various measures of position, such
as percentiles, deciles, and quartiles.
© Copyright McGraw-Hill 2004
2
Objectives (cont’d.)
• Use the techniques of exploratory data
analysis, including boxplots and fivenumber summaries to discover various
aspects of data.
© Copyright McGraw-Hill 2004
3
Introduction
• Statistical methods can be used to
summarize data.
• Measures of average are also called
measures of central tendency and include
the mean, median, mode, and midrange.
• Measures that determine the spread of data
values are called measures of variation or
measures of dispersion and include the
range, variance, and standard deviation.
© Copyright McGraw-Hill 2004
4
Introduction (cont’d.)
• Measures of position tell where a specific
data value falls within the data set or its
relative position in comparison with other
data values.
• The most common measures of position
are percentiles, deciles, and quartiles.
© Copyright McGraw-Hill 2004
5
Introduction (cont’d.)
• The measures of central tendency,
variation, and position are part of what is
called traditional statistics. This type of
data is typically used to confirm
conjectures about the data.
© Copyright McGraw-Hill 2004
6
Introduction (cont’d.)
• Another type of statistics is called
exploratory data analysis. These
techniques include the the box plot and
the five-number summary. They can be
used to explore data to see what they
show.
© Copyright McGraw-Hill 2004
7
Basic Vocabulary
• A statistic is a characteristic or measure
obtained by using the data values from a
sample.
• A parameter is a characteristic or measure
obtained by using all the data values for a
specific population.
• When the data in a data set is ordered it is
called a data array.
© Copyright McGraw-Hill 2004
8
General Rounding Rule
• In statistics the
basic rounding rule
is that when
computations are
done in the
calculation, rounding
should not be done
until the final answer
© Copyright McGraw-Hill 2004
is calculated.
9
The Arithmetic Average
• The mean is the sum of the values divided
by the total number of values.
• Rounding rule: the mean should be
rounded to one more decimal place than
occurs in the raw data.
• The type of mean that considers an
additional factor is called the weighted
mean.
© Copyright McGraw-Hill 2004
10
Weighted Mean
In some cases, values vary in their degree of
importance, so they are weighted accordingly
 (w • x)
x =
w
© Copyright McGraw-Hill 2004
11
The Arithmetic Average
• The Greek letter  (mu) is used to
represent the population mean.
• The symbol x (“x-bar”) represents the
sample mean.
• Assume that data are obtained from a
sample unless otherwise specified.
© Copyright McGraw-Hill 2004
12
Median and Mode
• The median is the halfway point in a data
set. The symbol for the median is MD.
• The median is found by arranging the data in
order and selecting the middle point.
• The value that occurs most often in a data
set is called the mode.
• The mode for grouped data, or the class
with the highest frequency, is the modal
class.
© Copyright McGraw-Hill 2004
13
Mean and Median
• Go to:
• http://www.ruf.rice.edu/~lane/stat_sim/des
criptive/index.html
• http://www.ruf.rice.edu/~lane/stat_sim/descriptive/index.h
tml
© Copyright McGraw-Hill 2004
14
Midrange
• The midrange is defined as the sum of the
lowest and highest values in the data set
divided by 2.
• The symbol for midrange is MR.
© Copyright McGraw-Hill 2004
15
Central Tendency: The Mean
• One computes the mean by using all the
values of the data.
• The mean varies less than the median or
mode when samples are taken from the
same population and all three measures are
computed for these samples.
• The mean is used in computing other
statistics, such ©as
variance.
Copyright McGraw-Hill 2004
16
Central Tendency: The Mean
(cont’d.)
• The mean for the data set is unique, and not
necessarily one of the data values.
• The mean cannot be computed for an openended frequency distribution.
• The mean is affected by extremely high or
low values and may not be the appropriate
average to use in these situations.
© Copyright McGraw-Hill 2004
17
Central Tendency: The Median
• The median is used when one must find the
center or middle value of a data set.
• The median is used when one must determine
whether the data values fall into the upper half
or lower half of the distribution.
• The median is used to find the average of an
open-ended distribution.
• The median is affected less than the mean by
extremely high or extremely low values.
© Copyright McGraw-Hill 2004
18
Central Tendency: The Mode
• The mode is used when the most typical
case is desired.
• The mode is the easiest average to
compute.
• The mode can be used when the data are
nominal, such as religious preference,
gender, or political affiliation.
• The mode is not always unique. A data set
can have more than one mode, or the mode
may not exist for
a data
set.
© Copyright
McGraw-Hill
2004
19
Examples
a. 5.40 1.10 0.42 0.73 0.48 1.10
b. 27 27 27 55 55 55 88 88 99
Mode is 1.10
Bimodal -
27
& 55
c. 1 2 3 6 7 8 9 10
No Mode
© Copyright McGraw-Hill 2004
20
Central Tendency: The
Midrange
• The midrange is easy to compute.
• The midrange gives the midpoint.
• The midrange is affected by extremely high
or low values in a data set.
© Copyright McGraw-Hill 2004
21
Best Measure of Center
© Copyright McGraw-Hill 2004
22
Distribution Shapes
• In a positively skewed or right skewed
distribution, the majority of the data values
fall to the left of the mean and cluster at the
lower end of the distribution.
© Copyright McGraw-Hill 2004
23
Distribution Shapes (cont’d.)
• In a symmetrical distribution, the data values
are evenly distributed on both sides of the
mean.
© Copyright McGraw-Hill 2004
24
Distribution Shapes (cont’d.)
• When the majority of the data values fall to the
right of the mean and cluster at the upper end of
the distribution, with the tail to the left, the
distribution is said to be negatively skewed or left
skewed.
© Copyright McGraw-Hill 2004
25
Skewness
Figure 2-11
© Copyright McGraw-Hill 2004
26
Recap
In this section we have discussed:
 Types of Measures of Center
Mean
Median
Mode
 Mean from a frequency distribution
 Weighted means
 Best Measures of Center
 Skewness
© Copyright McGraw-Hill 2004
27
Measures of
Variation
Because this section introduces the concept
of variation, this is one of the most important
sections in the entire book
© Copyright McGraw-Hill 2004
28
Definition
The range of a set of data is
the difference between the
highest value and the lowest
value
highest
value
© Copyright McGraw-Hill 2004
lowest
value
29
Definition
The standard deviation of a set of
sample values is a measure of
variation of values about the mean
© Copyright McGraw-Hill 2004
30
Population Variance
• The variance is the average of the squares
of the distance each value is from the mean.
The symbol for the population variance is 2.
x   


2

2
N
© Copyright McGraw-Hill 2004
31
Population Standard Deviation
• The standard deviation is the square root
of the variance. The symbol for the
population standard deviation is .
Rounding rule: The final answer should be
rounded to one more decimal place than
2
the original data.2
x   

  
N
© Copyright McGraw-Hill 2004
32
Sample Standard
Deviation Formula
S=
 (x - x)
n-1
© Copyright McGraw-Hill 2004
2
33
Sample Standard Deviation
(Shortcut Formula)
n (x ) - (x)
n (n - 1)
2
s=
2
Formula 2-5
© Copyright McGraw-Hill 2004
34
Rationale for Formula
Why n – 1 rather than n is used?
© Copyright McGraw-Hill 2004
35
Standard Deviation from a
Frequency Distribution
Formula 2-6
n [(f • x 2)] - [(f • x)]2
S=
n (n - 1)
Use the class midpoints as the x
values
© Copyright McGraw-Hill 2004
36
Standard Deviation Key Points
 The standard deviation is a measure of variation of
all values from the mean
 The value of the standard deviation s is usually
positive
 The value of the standard deviation s can increase
dramatically with the inclusion of one or more
outliers (data values far away from all others)
 The units of the standard deviation s are the same as
the units of the original data values
© Copyright McGraw-Hill 2004
37
Definition
 The variance of a set of values is a measure of
variation equal to the square of the standard
deviation.
 Sample variance: Square of the sample standard
deviation s
 Population variance: Square of the population
standard deviation 
© Copyright McGraw-Hill 2004
38
Variance Notation
standard deviation squared
}
Notation
s

2
2
Sample variance
Population variance
© Copyright McGraw-Hill 2004
39
•
Variance and Standard
Deviation
Variances and standard deviations can be
used to determine the spread of the data. If
the variance or standard deviation is large,
the data are more dispersed. The
information is useful in comparing two or
more data sets to determine which is more
variable.
• The measures of variance and standard
© Copyright McGraw-Hill 2004
deviation are used
to determine the
40
Round-off Rule
for Measures of Variation
Carry one more decimal place
than is present in the original
set of data.
Round only the final answer, not values in
the middle of a calculation.
© Copyright McGraw-Hill 2004
41
Coefficient of Variation
• The coefficient of variation is the standard
deviation divided by the mean. The result is
expressed as a percentage.
• The coefficient of variation is used to
compare standard deviations when the units
are different for the two variables being
compared.
© Copyright McGraw-Hill 2004
42
Definition
The coefficient of variation (or CV) for a set of
sample or population data, expressed as a
percent, describes the standard deviation relative
to the mean
Sample
CV =
Population
s
100%
x
CV =
© Copyright McGraw-Hill 2004

100%

43
Variance and Standard
Deviation (cont’d.)
• The variance and standard deviation are
used to determine the number of data values
that fall within a specified interval in a
distribution.
• The variance and standard deviation are
used quite often in inferential statistics.
© Copyright McGraw-Hill 2004
44
Chebyshev’s Theorem
• The proportion of values from a data set that
will fall within k standard deviations of the
mean will be at least 1 – 1/k2; where k is a
number greater than 1.
• This theorem applies to any distribution
regardless of its shape.
© Copyright McGraw-Hill 2004
45
Definition
Chebyshev’s Theorem
The proportion (or fraction) of any set of data lying
within K standard deviations of the mean is always at
least 1-1/K2, where K is any positive number greater
than 1.
 For K = 2, at least 3/4 (or 75%) of all values lie
within 2 standard deviations of the mean
 For K = 3, at least 8/9 (or 89%) of all values lie
within 3 standard deviations of the mean
© Copyright McGraw-Hill 2004
46
Empirical Rule for Normal
Distributions
The following apply to a bell-shaped
distribution.
• Approximately 68% of the data values fall
within one standard deviation of the mean.
• Approximately 95% of the data values fall
within two standard deviations of the mean.
• Approximately 99.75% of the data values fall
within three standard
deviations of the mean.
© Copyright McGraw-Hill 2004
47
The Empirical Rule
© Copyright
McGraw-Hill
FIGURE
2-13 2004
48
The Empirical Rule
© Copyright
McGraw-Hill
FIGURE
2-13 2004
49
The Empirical Rule
© Copyright
McGraw-Hill
FIGURE
2-13 2004
50
Recap
In this section we have looked at:
 Range
 Standard deviation of a sample and population
 Variance of a sample and population
 Coefficient of Variation (CV)
 Standard deviation using a frequency distribution
 Empirical Distribution
 Chebyshev’s Theorem
© Copyright McGraw-Hill 2004
51
Measures
of Position
© Copyright McGraw-Hill 2004
52
Standard Scores
• A standard score or z score is used when
direct comparison of raw scores is
impossible.
• A standard score or z score for a value is
obtained by subtracting the mean from the
value and dividing the result by the standard
deviation.
© Copyright McGraw-Hill 2004
53
Definition
 z Score
(or standard score)
the number of standard
deviations that a given value x is
above or below the mean.
© Copyright McGraw-Hill 2004
54
Measures of Position
z score
Sample
Population
x
x
z= s
x
µ
z=

Round to 2 decimal places
© Copyright McGraw-Hill 2004
55
Interpreting Z Scores
FIGURE 2-14
Whenever a value is less than the mean, its
corresponding z score is negative
Ordinary values:
z score between –2 and 2 sd
Unusual Values:
z score < -2 or z score > 2 sd
© Copyright McGraw-Hill 2004
56
Percentiles
• Percentiles are position measures used in
educational and health-related fields to indicate the
position of an individual in a group.
• A percentile, P, is an integer between 1 and 99
such that the Pth percentile is a value where P %
of the data values are less than or equal to the
value and 100 – P % of the data values are greater
than or equal to the value.
© Copyright McGraw-Hill 2004
57
Finding the Percentile
of a Given Score
number of values less than x+0.5
Percentile of value x =
• 100
total number of values
© Copyright McGraw-Hill 2004
58
Converting from the
p Percentile to the
Corresponding Data Value
Notation
c=
n*p
100
n
p
c
Pk
total number of values in the data set
percentile being used
locator that gives the position of a value
kth percentile
© Copyright McGraw-Hill 2004
59
Quartiles and Deciles
• Quartiles divide the distribution into four
groups, denoted by Q1, Q2, Q3. Note that Q1
is the same as the 25th percentile; Q2 is the
same as the 50th percentile or the median;
and Q3 corresponds to the 75th percentile.
• Deciles divide the distribution into 10 groups.
They are denoted by D1, D2, …, D10.
© Copyright McGraw-Hill 2004
60
Quartiles
Q1, Q2, Q3
divides ranked scores into four equal parts
25%
(minimum)
25%
25% 25%
Q1 Q2 Q3
(maximum)
(median)
© Copyright McGraw-Hill 2004
61
Definition
 Q1 (First Quartile) separates the bottom
25% of sorted values from the top 75%.
 Q2 (Second Quartile) same as the median;
separates the bottom 50% of sorted
values from the top 50%.
 Q1 (Third Quartile) separates the bottom
75% of sorted values from the top 25%.
© Copyright McGraw-Hill 2004
62
Percentiles
Just as there are quartiles separating data
into four parts, there are 99 percentiles
denoted P1, P2, . . . P99, which partition the
data into 100 groups.
© Copyright McGraw-Hill 2004
63
Some Other Statistics
 Interquartile Range (or IQR): Q3 - Q1
 Semi-interquartile Range:
Q3 - Q1
2
 Midquartile:
Q3 + Q1
2
© Copyright McGraw-Hill 2004
64
Recap
In this section we have discussed:
 z Scores
 z Scores and unusual values
 Quartiles
 Percentiles
 Converting a percentile to corresponding data
values
 Other statistics
© Copyright McGraw-Hill 2004
65
Outliers
• An outlier is an extremely high or an extremely
low data value when compared with the rest of
the data values.
• Outliers can be the result of measurement or
observational error.
• When a distribution is normal or bell-shaped,
data values that are beyond three standard
deviations of the mean can be considered
suspected outliers.
© Copyright McGraw-Hill 2004
66
Important Principles
 An outlier can have a dramatic effect on the
mean
 An outlier have a dramatic effect on the
standard deviation
 An outlier can have a dramatic effect on the
scale of the histogram so that the true
nature of the distribution is totally
obscured
© Copyright McGraw-Hill 2004
67
Exploratory Data Analysis
• The purpose of exploratory data analysis is
to examine data in order to find out what
information can be discovered. For example:
– Are there any gaps in the data?
– Can any patterns be discerned?
© Copyright McGraw-Hill 2004
68
Boxplots and Five-Number
Summaries
• Boxplots are graphical representations of a five-
number summary of a data set. The five specific
values that make up a five-number summary are:
– The lowest value of data set (minimum)
– Q1 (or 25th percentile)
– The median (or 50th percentile)
– Q3 (or 75th percentile)
– The highest value of data set (maximum)
© Copyright McGraw-Hill 2004
69
Boxplots
Figure
2-16
© Copyright McGraw-Hill 2004
70
Boxplots
© Copyright McGraw-Hill 2004
71
Recap
In this section we have looked at:
 Exploratory Data Analysis
 Effects of outliers
 5-number summary and boxplots
© Copyright McGraw-Hill 2004
72
Summary
• Some basic ways to summarize data include
measures of central tendency, measures of
variation or dispersion, and measures of
position.
• The three most commonly used measures of
central tendency are the mean, median, and
mode. The midrange is also used to
represent an average.
© Copyright McGraw-Hill 2004
73
Summary (cont’d.)
• The three most commonly used
measurements of variation are the range,
variance, and standard deviation.
• The most common measures of position are
percentiles, quartiles, and deciles.
• Data values are distributed according to
Chebyshev’s theorem and in special cases,
the empirical rule.
© Copyright McGraw-Hill 2004
74
Summary (cont’d.)
• The coefficient of variation is used to describe
the standard deviation in relationship to the
mean.
• These methods are commonly called
traditional statistics.
• Other methods, such as the boxplot and fivenumber summary, are part of exploratory data
analysis; they are used to examine data to see
what they reveal.
© Copyright McGraw-Hill 2004
75
Conclusions
• By combining all of these
techniques together, the
student is now able to
collect, organize,
summarize and present
data.
© Copyright McGraw-Hill 2004
76
HOMEWORK
•
•
•
•
Start at page 169, Review Exercises
2,5,8,15,20,21,22
Data Analysis page 166
1,2,3 (Use the first 30 values of weight for
Databank)
© Copyright McGraw-Hill 2004
77