Download STA120_Chapter3 – Students

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
CHAPTER 3
NUMERICAL
DESCRIPTIVE
MEASURES
1
MEASURES OF CENTRAL TENDENCY FOR
UNGROUPED DATA



In Chapter 2, we used tables and
graphs to summarize a data set.
In Chapter 3, we will estimate
numerical summary measures
to identify important features of
a distribution.
We begin by focusing on
numerical summary measures
that identify the center and
spread of a distribution.
Measure of Central Tendency
 Measure of central tendency tells us where the center of a
histogram or a frequency distribution lies.
 We will focus on three measures of central tendency:




Mean
Median
Mode
Other measures include trimmed mean, weighted mean, &2
geometric mean
Mean or Arithmetic Mean
The mean or arithmetic for ungrouped or raw
data is defined as the sum of all values divided by the
number of values in the data set. So,
Mean for population data:
x


N
Mean for sample data:
x

x
n
3
Mean or Arithmetic Mean
Example 1 – Sample Mean
The following table gives the standard deductions and
personal exemptions for persons filing with “single” status on
their 2009 state income taxes in a random sample of 9 states.
State
Delaware
Hawaii
Kentucky
Minnesota
North Dakota
Oregon
Rhode Island
Vermont
Virginia
Standard Deduction
(in dollars)
3250
2000
2100
5450
5450
1865
5450
5450
3000
Personal Exemption
(in dollars)
110
1040
20
3500
3500
169
3500
3500
930
Find the mean for the data on standard deduction.
4
Mean or Arithmetic Mean
x  x
1
 x2  x3  x 4  x5  x6  x7  x8  x 9
 3250  2000  2100  5450  5450  1865  5450  5450  3000
x 34015

x

 $3779.44
n
9
Thus, the mean 2009
standard deduction of these
nine states was $3,779.44
Example 2 – Population Mean
The following data set belongs to a population
5
-7
2
0
-9
16
Find the mean.
Solution?
10
7
5
Mean or Arithmetic Mean
Example 3 – Effect of outliers on Mean
Find the mean for the data on Example 1 for personal
exemption without the states of Minnesota, North Dakota,
Rhode Island, and Vermont.
Now, find the mean for the data on Problem 3.11 for personal
exemption. 110, 1040, 20, 3500, 3650, 176, 3650, 3650, 930
Thus, the contributions of the four states causes more than
fourfold increase in the value of the mean.
6
Mean or Arithmetic Mean - Summary

Each value of the data set is used in the calculation.

The population mean µ is constant, whereas the
sample mean varies from sample to sample.

Mean is not always the best measure of central
tendency of a data set.

Mean is greatly affected by outliers.

When outliers exist in a data set, it is important to
use trimmed mean or median.

Trimmed mean is calculated by dropping a certain
percentage of values from both ends of a ranked data
set.
7
Median
The median is the value of the middle term in a data set that
has been arranged in increasing order.
The steps for calculating median are:
1.
2.
Arrange the data set in increasing order.
Find or locate the middle term. Then the value of this
term is the median.
To locate the middle term and find median:
1.
2.
For odd number of observations, location of middle term is (n or N )  1
2
Thus, median = Value of middle term
For even number of observations, location of the middle term is
based on two terms, one from the left and other from the right of
the data set.
th
th




 n or N 
 n or N 
term   Value of 
term 
Value of 


2
2








from the right
from the left



Thus, median = 
8
2
Median
Example 4
Find the median for the data on Example 1 for standard
deduction.
First, we rank the given data in increasing order as follows:
1865 2000 2100 3000 3250 5450 5450 5450 5450
Since there are nine states in this sample data set, then the
n  1 10

 5th term
Location of middle term =
2
2
1865 2000 2100 3000 3250 5450 5450 5450 5450
Thus, the median standard deduction is $3250.
9
Median
Example 5
Find the median for: 258.7 77.8 393.1 427.0 273.6 2977.0
First, we rank the given data in increasing order as follows:
77.8 258.7 273.6 393.1 427.0 2977.0
Since there are six companies in this sample data set, then the
Locations of the two terms are
left and right.
n 6
  3th terms counting from the
2 2
77.8 258.7 273.6 393.1 427.0 2977.0
273.6  393.1
Median =
 333.35
2
Thus, the median for the data set is 333.35.
10
Median - Summary

Median gives the middle of a distribution, with half
the data values to the left of the median and half to
the right of the median.

Median is not influenced by outliers.

Median is preferred over the mean as a measure of
central tendency for data sets that contain outliers.
11
Mode
Mode is defined as the value that occurs the most or with
the highest frequency in a data set.
Example 6
Find the mode for the data
on Example 1 for standard
deduction.
State
Delaware
Hawaii
Kentucky
Minnesota
North Dakota
Oregon
Rhode Island
Vermont
Virginia
Standard Deduction
(in dollars)
3250
2000
2100
5450
5450
1865
5450
5450
3000
Personal Exemption
(in dollars)
110
1040
20
3500
3500
169
3500
3500
930
In this data set, 5450 occurs four times while each
remaining values occurs only once. 5450 is the mode
because it has the highest frequency. Therefore,
Mode = $5450
12
Mode - Summary

Mode can be calculated for both qualitative and
quantitative data set.

A data set may have no or more than one mode.

No mode = Data set where each value occurs only once.

One mode = Data set where there is only one value with
the highest frequency. This data set is called unimodal.

Two modes = Data set where there are two values with
the highest frequencies. This data set is called bimodal.

More than two modes = Data set where there are more
than two values with the highest frequencies. This data set
is called multimodal.
13
Relationships among the Mean, Median, and Mode
1. For
a symmetric histogram and
frequency distribution curve
mean = median = mode
2. For
right-skewed histogram and
frequency distribution curve
mode < median < mean
3. For
left-skewed histogram and
frequency distribution curve
mean < median < mode
14
MEASURES OF DISPERSION FOR UNGROUPED
DATA
Mean, median, or mode does not tell us the spread, variation,
or dispersion of a distribution.
For example: The number of car thefts that occurred in two
neighboring cities for the past 12 days are given as:
City A:
City B:
6 4 7 11 4 3 9 7 2 7 9 15
8 10 14 0 0 10 20 0 15 3 3 1

The data sets have the same mean, 7 cars per day.

Without the data set, this suggests that the same number of
cars were stolen per day for the past 12 days in both cities.

Using a Dotplot, the two cities have different variation.
15

We need a measure of dispersion or variation
Range
Variance
Standard
deviation
16
Range for Ungrouped Data
Example 7
The following data give the number of pieces of junk mail
received by 7 families during the past month.
41 33 28 21 29 19 2
a.
Find the range with all the values in the data set
b.
Find the range without the value of 2
a.
b.
Range = Largest value – Smallest value
= 41 – 2 = 39 junk mail
Range = 41 – 19 = 22
The range is decreased from 39 to 22 junk mail just by
dropping the outlier, 2. Therefore, range is influenced by
outliers.
17
Range - Summary

Range is not a good measure of dispersion of a data set
with outliers because its value is greatly affected by
outliers.

Range is also not a satisfactory measure of dispersion
because it uses only two values, largest and smallest, in
the data set.
18
Variance and Standard Deviation





The standard deviation is the most used measure of dispersion
because it tells the closeness of the values of a data set to or
around the mean.
Variance is denoted as (σ sigma)
σ2 for population data
s2 for sample data
Standard deviation is defined as the principal square root of
the variance
Standard deviation is denoted as
σ for population data
s for sample data
What does a value of the standard deviation mean?
 Lower value = Values are spread relatively over a smaller
range around the mean
 Larger value = Values are spread relatively over a larger
range around the mean
19
Variance and Standard Deviation – Formula for
Ungrouped Data
Basic Formula
 (x   )
 
N
2
Variance
 (x  x )
2
s 
n 1
Standard
Deviation
  2
2
2
Short-Cut Formula
(  x )2
x 
2
N
 
N
(  x )2
2
x 
2
n
s 
n 1
2
s  s2
Note

x   or x  x
indicates the deviation of each value of the data set
from the mean.

The sum of all the deviations must always be zero.
20
Variance and Standard Deviation
Example 8 - Sample
Find the variance and standard
deviation for the sample data in
the given table.
21
Variance and Standard Deviation - Summary

The values of the variance and the standard
deviation cannot be negative.
Why?

The value of variance and standard deviation
can be zero, if a data set has no variation.

The measurement unit of variance is the square
of the measurement unit of the original data.

The measurement unit of standard deviation is
the measurement unit of the original data.
Why?
22
Population Parameters and Sample Statistics
Mean, median, mode, range, variance, or standard deviation
calculated for:

A population data set is called a population parameter or just
parameter. µ and σ are examples of population parameters

A sample data set is called a sample statistic, or just
statistic. x and s are example of sample statistic.
23
MEAN, VARIANCE AND STANDARD DEVIATION
FOR GROUPED DATA
Skip
24
USE OF STANDARD DEVIATION
So far, we can find the mean and standard
deviation of a distribution data. But the
question is:
Whether we can use the mean and standard
deviation to find the percentage or proportion of
the data set that lie within an interval of the
mean.
The
answer is yes if we combine the mean and
standard deviation.
To
do this, we can use
 Chebyshev’s theorem or
 Empirical rule.
Our
focus is only on the empirical rule
25
Empirical Rule


Empirical rule only works for a
bell-shaped distribution. That is,
empirical rule cannot be applied
to other distributions such as leftskewed, right-skewed, and
uniform distributions.
For a bell-shaped distribution, the
percentage or proportion of a
data set that lie within an interval
of the mean is determined under
the following three rules
 68% of the observations lie
within one standard deviation
of the mean
 95% of the observations lie within
two standard deviations of the
mean
 99.7% of the observations lie
within three standard deviations of
the mean.
26
Empirical Rule
Example 12a
Suppose that on a certain section of I-95 with a posted speed limit of 65 mph, the
speeds of all vehicles have a bell-shape distribution with a mean of 72 mph and a
standard deviation of 3 mph. Using the empirical rule, find the percentage of
vehicles with 63 to 81 mph on this section of I-95.
Solution
x
27
Empirical Rule



Example: The prices of all college textbooks follow a bellshaped distribution with a mean of $105 and a standard
deviation of $20.
A) Find the percentage of all college textbooks with thier
prices between $85 and $125
Solution:

28
Empirical Rule

B) Find the percentage of all college textbooks with thier
prices between $65 and $145.

C) Find the interval that contains the prices of 99.7%.
29
MEASURES OF POSITION
Definition
A measure of position determines the position of
a single value in relation to other values in a sample
or population.
We will discuss only the following measures of
position.
Quartiles and Interquartile Range
Percentiles and Percentile Rank
30
Quartiles and Interquartile Range
Definition
Quartiles are three summary measures that divide a
ranked data set into four equal parts.

The first quartile is the value of the middle term among the
observations that are less than the median

The second quartile is the same as the median of a data set.

The third quartile is the value of the middle term among the
observations that are greater than the median.
Quartiles and Interquartile Range
Calculating Interquartile Range
Interquartile range is the difference between the
third and first quartiles. That is,
IQR = Interquartile range = Q3 – Q1
Example 13
The 2008 profits (rounded to
billions of dollars) of 12 companies
selected from all over the world are
shown in the table.
a)
b)
Find the values of the three
quartiles. Where does the 2008
profits of Merck & Co fall in
relation to these quartiles?
Find the interquartile range.
Example 13
a)
By looking at the position of $8 billion, which is the 2008 profit of Merck
& Co, we can state that this value lies in the bottom 25% of the profits
for 2008.
b)
IQR = Interquartile range = Q3 – Q1
= 15.5 – 9.5
= $6 billion
Percentiles and Percentile Rank
Percentile is a summary measure that divides a ranked data set into 100 equal parts.
Each part contains 1% of the data set. Therefore, a data set has 99 percentiles, which
are denoted by P1, P2, P3,… P99.
P1 is the 1st percentile and is defined as a value in a ranked data set such that 1% of the
values in the data set are smaller than the value P1 and 99% of the values are greater
than the value of P1.
P2 is the 2nd percentile and is defined as a value in a ranked data set such that 2% of the
values in the data set are smaller than the value P2 and 98% of the values are greater
than
the value of P2.
.
.
.
P44 is the 44th percentile and is defined as a value in a ranked data set such that 44% of
the values in the data set are smaller than the value P44 and 56% of the values are
35
greater than the value of P44.
Percentiles and Percentile Rank
.
.
.
Pk is the kth percentile and is defined as a value in a ranked data set such that k% of the
values in the data set are smaller than the value Pk and (100 - k)% of the values are
greater than the value of Pk.
Example:
A student scored 520 on the quantitative portion of the SAT examination. The student
score corresponds to 68th percentile. Give a brief interpretation of the student's percentile.
Solution:
36
Percentiles and Percentile Rank
Calculation of Percentile
The approximate value of the kth percentile, denoted by Pk, is
determine as
th
 kn 
Pk  Value of the 
 term in a ranked data set
 100 
where
k = the number of the percentile.
n = the sample size.
Percentile Rank
Percentile rank of a value, xi, in a data set is the percentage of
values in the data set that are less than xi,. It is calculated as
Percentile Rank of x i 
Number of values less than x i
 100
Total number of values in the data set
37
Example 14
The following data give the numbers of computer keyboards assembled at the
Twentieth Century Electronics Company for a sample of 25 days.
45 52 48 41 56 46 44 42 48 53 51 53 51
48 46 43 52 50 54 47 44 47 50 49 52
The data arranged in increasing order as follows:
41 42 43 44 44 45 46 46 47 47 48 48 48
49 50 50 51 51 52 52 52 53 53 54 56
Determine the approximate value of the 53th percentile.
kn
(53)( 25)

 13.25 ~ 13th term
100
100
Therefore,
P53 = 48 percentile
Example 15
Find the percentile rank for of 50 computer keyboard. Give a
brief interpretation of this percentile rank.
The data arranged in increasing order as follows:
41 42 43 44 44 45 46 46 47 47 48 48 48
49 50 50 51 51 52 52 52 53 53 54 56
In this data set, 14 of the 25 values are less than 50. Hence,
Number of values less than x i
Percentile Rank of x i 
 100
Total number of values in the data set
Percentile  rank  50 
14
(100)  56%
25
About 56% of these 25 days had less than 50 computer keyboard
produced. Hence, 44% of these 25 days had 50 computer
keyboards produced or higher profit in 2008.
BOX-AND-WHISKER PLOT
Box-and-whisker plot use the
1. Median
2. 1st quartile,
3. 3rd quartile, and
4. Smallest and largest values in the data set between the lower and
upper inner fences to graphically display data.
Lower inner fence = 1.5(IQR) below the Q1 = Q1 - 1.5(IQR)
Upper inner fence = 1.5(IQR) above the Q3 = Q3 + 1.5(IQR)
Advantages of box-and-whisker plot
1.Visually displays the center, spread, and the skewness of a data set.
2.Clearly identifies outliers.
3.Helps to compare different distributions.
40
Box-and-Whisker Plot
Steps to Plot Box-and-Whisker Chart
1. Arrange the data set in increasing order
2. Calculate the following:
• Median, Q1, Q3, and
• IQR = Q3 - Q1
3. Determine the lower and upper inner fences
4. Determine the smallest and largest values within the lower and upper inner
fences.
5. Draw a horizontal number line and mark the line covering all the values in
the data set.
6. Above the number line, draw a box with
• The left side at Q1 and the right side at Q3 and
• A vertical line at the median (inside the box).
7. Identify the smallest and largest values within the lower and upper inner
fences with short vertical lines above the number line. Then , draw two
lines joining each vertical line to the box. These lines are called whiskers.
41
Box-and-Whisker Plot
Steps to Plot Box-and-Whisker Chart
8. A value that falls outside either of the inner fences is called an outlier.
9. An outlier could be:
• Mild or
• Extreme
10. A mild outlier occurs when a value falls outside any of the inner fences but
inside either a lower or upper outer fence.
11. An extreme outlier is a value that falls outside either of the outer fences.
12.Calculating outer fences:
• Lower outer fence = 3(IQR) below Q1 = Q1 - 3(IQR)
• Upper outer fence = 3(IQR) above Q3 = Q3 + 3(IQR)
42
Example 16
The following data are the incomes (in thousands of dollars) for
a sample of 12 households.
75
69
84
112
74
104
81
90
94
144
Construct a box-and-whisker plot for these data.
79
98
Example 16
Is this a Mild or extreme?
Calculating outer fences:

44