Download z - El Camino College

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
+
Discovering Statistics
2nd Edition Daniel T. Larose
Chapter 3:
Describing Data Numerically
Lecture PowerPoint Slides
+ Chapter 3 Overview

3.1 Measures of Center

3.2 Measures of Variability

3.3 Working with Grouped Data

3.4 Measures of Position and
Outliers

3.5 The Five-Number Summary
and Boxplots

3.6 Chebyshev’s Rule and the
Empirical Rule
2
+ The Big Picture
3
Where we are coming from and where we are headed…
 Chapter 2 showed us graphical and tabular summaries of
data.
In Chapter 3, we “crunch the numbers,” that is, develop
numerical summaries of data. We examine measures of center,
measures of variability, measures of position, and many other
numerical summaries of data.

 In Chapter 4, we will learn how to summarize the relationship
between two quantitative variables.
+ 3.1: Measures of Center
Objectives:

Calculate the mean for a given data set.
Find the median, and describe why the median is
sometimes preferable to the mean.


Find the mode of a data set.
Describe how skewness and symmetry affect these
measures of center.

4
5
The Mean
The most well-known and widely used measure of center is the
mean. In everyday usage, the word average is often used for mean.
To find the mean of the values in a data set, simply add up all the numbers
and divide by how many numbers you have.
Notation:
•The sample size (how many observations in the data set) is always
denoted by n.
•The ith data value is denoted by xi, where i is an index or counter
indicating which data point we are specifying.
•The notation for “add them together” is Σ(capital sigma), the Greek letter
“S,” because it stands for “Summation.”
•The sample mean is called x (pronounced “x-bar”).
The sample mean can be written as x = Sx n . In plain English, this just
means that, in order to find the mean, we
1. Add up all the data values, giving us Σx
2. Divide by how many observations are in the data set, giving us x = Sx n
6
The Population Mean
The mean value of the population is usually unknown. We denote
the population mean with µ (mu), which is the Greek letter “m.” The
population size is denoted by N.
When all the values of the population are known, the population
mean is calculated as
x
å
m=
N
We can use the sample mean as an estimate of µ. Note, however,
different samples may yield different sample means.
One drawback to using the mean to measure the center of the data is that
the mean is sensitive to the presence of extreme values in the data set.
7
The Median
In statistics, the median of a data set is the middle data value when
the data are put into ascending order.
The Median
The median of a data set is the middle data value when the data are put
into ascending order. Half of the data values lie below the median, and half
lie above.
•If the sample size n is odd, then the median is the middle value.
•If the sample size n is even, then the median is the mean of the two
middle data values.
Unlike the mean, the median is not sensitive to extreme values.
8
The Mode
A third measure of center is called the mode. In a data set, the mode is
the value that occurs the most.
The mode of a data set is the data value that occurs with the greatest
frequency.
Rank
Person
Followers
(millions)
1
Lady Gaga
6.6
2
Britney Spears
6.1
3
Ashton Kutcher
5.9
4
Justin Bieber
5.6
5
Ellen DeGeneres
5.3
6
Kim Kardashian
5.0
7
Taylor Swift
4.4
8
Oprah Winfrey
4.4
9
Katy Perry
4.2
10
John Mayer
3.7
Sample Mean
x=
Sx 6.6 + 6.1+... + 4.2 + 3.7
=
n
10
= 5.11 million
Median
5.0 + 5.3
2
= 5.15 million
median =
Mode
Two people have 4.4 million
followers. 4.4 million is the mode.
Skewness and Measures of
Center
The skewness of a distribution can often tell us something about the
relative values of the mean, median, and mode.
How Skewness Affects the Mean and Median
•For a right-skewed distribution, the mean is larger than the median.
•For a left-skewed distribution, the median is larger than the mean.
•For a symmetric unimodal distribution, the mean, median, and mode
are fairly close to one another.
9
+ 3.2: Measures of Variability
Objectives:

Understand and calculate the range of a data set.

Explain in my own words what a deviation is.
Calculate the variance and the standard deviation for
a population or a sample.

10
11
The Range
Section 3.1 introduced ways to find the center of a data set. Two data sets
can have exactly the same mean, median, and mode and yet be quite
different. We need measures that summarize the variation, or variability, of
the data.
Women’s Volleyball Team Heights (in)
Western Massachusetts Univ
Northern Connecticut Univ
60
66
70
67
70
70
70
70
75
72
12
The Range
There are a variety of ways to measure how spread out a data set is. The
simplest measure is the range.
The range of a data set is the difference between the largest value and
the smallest value in the data set:
range = largest value – smallest value
Women’s Volleyball Team Heights (in)
Western
Massachusetts Univ
Northern
Connecticut Univ
60
66
70
67
70
70
70
70
75
72
rangeWMU = 75 – 60 = 15 inches
rangeNCU = 72 – 66 = 6 inches
13
What is Deviation?
The range is simple to calculate, but has its drawbacks. It is quite sensitive
to extreme values and it completely ignores all of the values in the data set
other than the extremes. The standard deviation quantifies spread with
respect to the center and uses all available data values.
Deviation
A deviation for a given data value x is the difference between the data value
and the mean of the data set. For a sample, the deviation equals x – x-bar.
For a population, the deviation equals x – µ.
•If the data value is larger than the mean, the deviation will be positive.
•If the data value is smaller than the mean, the deviation will be negative.
•If the data value equals the mean, the deviation will be zero.
The deviation can roughly be thought of as the distance between a data
value and the mean, except that the deviation can be negative while distance
is always positive.
The Variance and Standard
Deviation
To compute the standard deviation and variance, we consider the squared
deviations. It is logical to build our measure of spread using the mean
squared deviation.
The population variance σ2 is the mean of the squared deviations in the
population given by the formula
2
 
2
x  
N
The population standard deviation σ is the positive square root of the
population variance and is found by
2

x  
N
The population standard deviation σ represents a distance from the mean
that is representative for that data set.
14
The Sample Variance and Sample
Standard Deviation
In the real world, we use the sample mean and sample standard
deviation to estimate the population parameters. The sample variance
also depends on the concept of the mean squared deviations. However,
we replace the denominator with n – 1 to better estimate the parameter.
The sample variance s2 is approximately the mean of the squared
deviations in the sample given by the formula
2
s =
2
(
S x-x
n -1
)
The sample standard deviation s is the positive square root of the sample
variance and is found by
2
s= s =
2
(
S x-x
n -1
)
The value of s may be interpreted as the typical difference between a data
value and the sample mean for a given data set.
15
16
Computational Formulas
The following computational formulas simplify the calculations for variance
and standard deviation. They are equivalent to the definition formulas.
Computational Formulas for the Variance and Standard Deviation
Population Variance
 
Population Standard Deviation
 x 
x 2 
2
N
2
s=
N
Sample Variance
s2 
x
2
Sx
2
Sx )
(
-
2
N
N
Sample Standard Deviation
x 


n
n 1
2
s=
Sx
2
Sx )
(
-
n
n -1
2
17
Population Example
2 
 x 
x 2 
2
N
N
(357.3) 2
20,997.918

8
 629.9998438
 630.0
   2  629.9998438  25.1
The standard deviation of farmland for all
counties in Connecticut is almost 25,100
acres.
18
Sample Example
Suppose we take a sample of three counties.
x 

2
x 
n
s2 
n 1
(213.6) 2
15,963.38 3

3 1
 377.53
2
s  s 2  377.53  19.4
The standard deviation of farmland for this
sample of three counties in Connecticut is
almost 19,400 acres.
+ 3.3: Working with Grouped Data
Objectives:

Calculate the weighted means.

Estimate the mean for grouped data.
Estimate the variance and standard deviation for
grouped data.

19
20
The Weighted Mean
Sometimes not all the data values in a data set are of equal
importance. Certain data values may be assigned greater weight
than others when calculating the mean.
Weighted Mean
To find the weighted mean:
1.Multiply each data point xi by its respective weight wi.
2.Sum these products.
3.Divide the result by the sum of the weights:
wi xi w1 x1  w2 x2  ...  wn xn
xw 

wi
w1  w2  ...  wn
Estimating the Mean for
Grouped Data
Data are often reported using frequency distributions. Without the
original data, we cannot calculate the exact values of the measures
of center and spread.
For each class in the frequency distribution, we estimate the class
mean using the class midpoint. The class midpoint is defined as
the mean of two adjoining lower class limits and is denoted mi.
The product of the class frequency fi and class midpoint mi is used
as an estimate of the sum of the data values within that class.
Summing these products across all classes and dividing by the
total population size provides us with an estimated mean for data
grouped into a frequency distribution.
21
Estimating the Mean for
Grouped Data
Calculate the estimated mean age of the adopted children in this table.
Σmifi = (0.5)(12) + (3.5)(611) + (8.5)(320)
+ (13.5)(161) + (17)(46)
= 6 + 2138.5 + 2720 + 2173.5 + 782
= 7820
N = Σfi = 12 + 611 + 320 + 161 + 46
= 1150
22
Estimating the Variance and Standard
Deviation for Grouped Data
We also use class midpoints and class frequencies to calculate the
estimated variance for data grouped into a frequency
distribution and the estimated standard deviation for data
grouped into a frequency distribution.
Estimated Variance and Standard Deviation for Data Grouped
into a Frequency Distribution
Given a frequency distribution with k classes, the estimated
variance for the variable is given by
and estimated standard deviation is given by
23
+ 3.4: Measures of Position and
Outliers
Objectives:

Calculate z-scores and explain why we use them.

Detect outliers using the z-score method.
Find percentiles and percentile ranks for both small and
large data sets.


Computer quartiles and the interquartile range.
24
25
z-Scores
Our first measure of position is the z-score. The term z-score
indicates how many standard deviations a particular data value is
from the mean.
z-Score
The z-score for a particular data value from a sample is
z=
data value - mean x - x
=
standard deviation
s
The z-score for a particular data value from a population is
data value - mean x - m
z=
=
standard deviation
s
26
z-Scores
Suppose the mean score on the Math SAT is µ = 500, with a
standard deviation of σ = 100 points.
Jasmine’s Math SAT score is 650. What is her z-score?
x-x
z=
s
650 - 500
z=
100
z =1.5
Jasmine
27
z-Scores
In some cases, we may be given a z-score and asked to find its
associated data value x.
Given a z-score, to find its associated value x:
For a sample:
x = z - score × s + x
For a population: x = z - score × s + m
where µ is the population mean, x-bar is the sample mean, σ is the
population standard deviation, and s is the sample standard
deviation.
z-scores can also be used to compare data from different data sets.
That is, relative positions can be compared even when the means
and standard deviations of the data sets are different.
28
Detecting Outliers with z-Scores
An outlier is an extremely large or extremely small data value
relative to the rest of the data set. It may represent a data entry error,
or it may be genuine data.
Guidelines for Identifying Outliers
1.A data value whose z-score lies in the following range is not
considered to be unusual:
-2 < z-score < 2
2.A data value whose z-score lies in the following range may be
considered moderately unusual:
-3 < z-score ≤ -2 or 2 ≤ z-score < 3
3.A data value whose z-score lies in the following range may be
considered an outlier:
z-score ≤ -3 or : z-score ≥ 3
29
Percentiles and Percentile Ranks
The next measure of position we consider is the percentile, which
shows the location of a data value relative to the other values in the
data set.
Percentile
Let p be any integer between 0 and 100. the pth percentile of a
data set is the data value at which p percent of the values in the
data set are less than or equal to the value.
Percentile
The percentile rank of a data value x equals the percentage of
values in the data set that are less than or equal to x. In other
words:
percentile rank of data value x =
number of values in data set £ x
×100
total number of values in data set
30
Quartiles
Just as the median divides the data set into halves, the quartiles are
the percentiles that divide the data set into quarters.
Quartiles
The quartiles of a data set divide the data
set into four parts, each containing 25% of
the data.
•The first quartile (Q1) is the 25th percentile.
•The second quartile (Q2) is the 50th
percentile.
•The third quartile (Q3) is the 75th percentile.
For small data sets, the division may be into
four parts of only approximately equal size.
31
Quartiles
Find the quartiles of the dance scores of the 12 students on page 129:
First, arrange them in order from smallest to largest:
30 44 56 62 65 68 75 78 81 85 89 94
32
Interquartile Range
The variance and standard deviations are measures of spread that are
sensitive to the presence of extreme values. A more robust (less
sensitive) measure of variability is the interquartile range.
Interquartile Range
The interquartile range (IQR) is a robust measure of variability. It
is calculated as: IQR = Q3 – Q1.
The interquartile range is interpreted to be the spread of the middle
50% of the data.
IQR = 83 – 59 = 24
+ 3.5: Five-Number Summary and
Boxplots
Objectives:

Calculate the five-number summary of a data set.

Construct and interpret a boxplot for a given data set.

Detect outliers using the IQR method.
33
34
The Five-Number Summary
One robust (or resistant) method of summarizing data that is used
widely is called the five-number summary. The set consists of five
measures we have already seen.
The five-number summary consists of the following set of
statistics, which together constitute a robust summarization of a
data set:
1.Minimum; the smallest value in the data set
2.First quartile, Q1
3.Median, Q2
4.Third quartile, Q3
5.Maximum, the largest value in the data set
Min=30
Max=94
35
The Boxplot
The boxplot is a convenient graphical display of the five-number
summary of a data set.
Constructing a Boxplot by Hand
1. Determine the lower and upper fences:
Lower fence = Q1 – 1.5(IQR)
Upper fence = Q3 + 1.5(IQR)
2. Draw a horizontal number line that encompasses the range of
your data, including the fences. Draw vertical lines at Q1, the
median, and Q3. Connect the lines for Q1 and Q3 to form a box.
3. Temporarily indicate the fences with brackets [ and ].
4. Draw a horizontal line from Q1 to the smallest value greater than
the lower fence. Draw a horizontal line from Q3 to the largest value
smaller than the upper fence.
5. Indicate any data values smaller than the lower fence or larger
than the upper fence using an asterisk *.
36
The Boxplot
Min=30
Max=94
IQR = 83 – 59 = 24
Lower fence = 59– 1.5(24) = 23
Upper fence = 83 + 1.5(24) = 119
37
Boxplots for Skewed Data
38
Detecting Outliers with the IQR
The mean and standard deviation are sensitive to outliers. We can use
a more robust method of detecting outliers by using the IQR.
IQR Method to Detect Outliers
A data value is an outlier if
a.it is located 1.5(IQR) or more below Q1, or
b.it is located 1.5(IQR) or more above Q3.
+ 3.6: Chebyshev’s Rule and the
Empirical Rule
Objectives:
Calculate percentages using
Chebyshev’s Rule.

Find percentages and data values
using the Empirical Rule.

39
Chebyshev’s Rule
P.L. Chebyshev derived a result, called Chebyshev’s Rule, which can
be applied to any continuous data set whatsoever.
Chebyshev’s Rule
The proportion of values from a data set that will fall within k
standard deviations of the mean will be at least
1 

 1  2 100%
 k 
where k > 1. Chebyshev’s Rule may be applied to either samples
or populations. For example:
• k = 2. At least 3/4 (or 75%) of the data values will fall within 2
standard deviations of the mean.
• k = 3. At least 8/9 (or 88.89%) of the data values will fall within 3
standard deviations of the mean.
40
41
The Empirical Rule
When the data distribution is bell-shaped, the Empirical Rule
outperforms Chebyshev.
The Empirical Rule
If the data distribution is bell-shaped:
•About 68% of the data values will fall within 1 standard deviation of the
mean.
•About 95% of the data values will fall within 2 standard deviations of
the mean.
•About 99.7% of the data values will fall within 3 standard deviations of
the mean.
Stated in terms of z-scores:
•About 68% of the data values will have z-scores between -1 and 1.
•About 95% of the data values will have z-scores between -2 and 2.
•About 99.7% of the data values will have z-scores between -3 and 3.
42
The Empirical Rule
+ Chapter 3 Overview

3.1 Measures of Center

3.2 Measures of Variability

3.3 Working with Grouped Data

3.4 Measures of Position and
Outliers

3.5 The Five-Number Summary
and Boxplots

3.6 Chebyshev’s Rule and the
Empirical Rule
43