Download Document

Document related concepts
no text concepts found
Transcript
Statistics
Descriptive Statistics – Numerical
Measures
1/71
Contents






Measures of location
Measures of variability
Measures of distribution shape, relative
location, and detecting outliers
Exploratory data analysis
Measures of association between two variables
The weighted mean and working with grouped
data
2/71
Contents




Measures of Distribution Shape, Relative
Location, and Detecting Outliers
Exploratory Data Analysis
Measures of Association Between Two
Variables
The Weighted Mean and Working with
Grouped Data
3/71
STATISTICS in PRACTICE


Small Fry Design is a toy and accessory
company that designs and imports products
for infants.
Cash flow management is one of the
most critical activities in the day-today operation of this company.
STATISTICS in PRACTICE


A critical factor in cash flow management is
the analysis and control of accounts receivable.
By measuring the average age and dollar value
of outstanding invoices.
The company set the following goals: the
average age for outstanding invoices should not
exceed 45 days, and the dollar value of
invoices more than 60 days old should not
exceed 5% of the dollar value of all accounts
receivable.
Measures of Location



If the measures are computed for data from a
sample, they are called sample statistics.
If the measures are computed for data from a
population, they are called population
parameters.
A sample statistic is referred to as the point
estimator of the corresponding population
parameter.
Mean




The mean of a data set is the average of
all the data values.
Population mean m.
wi xi

Sample mean x 
w
x

w
i
i

i
The sample mean x is the point estimator
w

of the population mean m. i
Sample Mean x

x
x
n
i
Sum of the values of the n observations

Number of observationsin the sample
Population Mean m
x

m
Sum of the values of the N observations

N
Number of observations in the population
i
Sample Mean


Example: Monthly Starting Salaries for a
Sample of 12 Business School Graduates
Data
Sample Mean

The mean monthly starting salary
x

x
n
i
x1  x 2    x12

12
2850  2950    2880

12
35280

 2940
12
Sample Mean

Example: Apartment Rents
Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next
slide.
Sample Mean
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Sample Mean
Median

The median of a data set is the value in the
middle when the data items are arranged in
ascending order.

Whenever a data set has extreme values, the
median is the preferred measure of central
location.
Median

The median is the measure of location most often
reported for annual income and property value
data.

A few extremely large incomes or property values
can inflate the mean.
Median


Example: Monthly Starting Salaries for a
Sample of 12 Business School Graduates
We first arrange the data in ascending order.
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Middle Two Values

Because n = 12 is even, we identify the middle
two values: 2890 and 2920.
2890  2920
Median 
 2905
2
Median

For an odd number of observations:
26 18 27 12 14 27 19
7 observations
12 14 18 19 26 27 27
in ascending order
the median is the middle value.
Median = 19
Median

For an even number of observations:
26 18 27 12 14 27 30 19
8 observations
12 14 18 19 26 27 27 30
in ascending order
the median is the average of the middle two values.
Median = (19 + 26)/2 = 22.5
Mode

Example: frequency distribution of 50
Soft Drink Purchases
Soft Drink
Coke Classic
Diet Coke
Dr. Pepper
Pepsi-Cola
Sprite
Total

Frequency
19
8
5
13
5
50
The mode, or most frequently purchased
soft drink, is Coke Classic.
Mode
450 occurred most frequently (7 times)
Mode = 450
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Percentiles

A percentile provides information about how the
data are spread over the interval from the smallest
value to the largest value.

Admission test scores for colleges and
universities are frequently reported in terms
of percentiles.
Percentiles

The pth percentile of a data set is a value such
that at least p percent of the items take on this
value or less and at least (100 - p) percent of
the items take on this value or more.
Percentiles


Example: Monthly Starting Salaries for a
sample of 12 Business School Graduates
Let us determine the 85th percentile for the
starting salary data
Percentiles



Step 1. Arrange the data in ascending order.
2710 2755 2850 2880 2880 2890 2920 2940
2950 3050 3130 3325
Step 2.
 P 
 85 
i
n  
12  10.2
 100 
 100 
Step 3.
Because i is not an integer, round up. The
position of the 85th percentile is the next
integer greater than 10.2, the 11th position.
Percentiles
Arrange the data in ascending order.
 Compute index i, the position of the pth
percentile.
i = (p/100)n
 If i is not an integer, round up. The pth
percentile is the value in the ith position.
 If i is an integer, the pth percentile is the average
of the values in positions i and i +1.

90th Percentile
i = (p/100)n = (90/100)70 = 63
Averaging the 63rd and 64th data values:
90th Percentile = (580 + 590)/2 = 585
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
90th Percentile
“At least 90%
of the items
take on a value
of 585 or less.”
“At least 10%
of the items
take on a value
of 585 or more.”
63/70 = .9 or 90%
7/70 = .1 or 10%
Quartiles




Quartiles are specific percentiles.
First Quartile = 25th Percentile
Second Quartile = 50th Percentile = Median
Third Quartile = 75th Percentile
Quartiles
Third Quartile
Third quartile = 75th percentile
i = (p/100)n = (75/100)70 = 52.5 = 53
Third quartile = 525
Measures of Variability

It is often desirable to consider measures of
variability (dispersion), as well as measures
of location.

For example, in choosing supplier A or supplier
B we might consider not only the average
delivery time for each, but also the variability
in delivery time for each.
Measures of Variability





Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
Range



The range of a data set is the difference between
the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest
data values.
Range
Range = largest value - smallest value
Range = 615 - 425 = 190
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Interquartile Range



The interquartile range of a data set is the
difference between the third quartile and the
first quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data
values.
Interquartile Range
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
Variance
The variance is a measure of variability that
utilizes all the data.
It is based on the difference between the value of
each observation (xi) and the mean ( x for
a sample, μ for a population).
Variance
The variance is the average of the squared
differences between each data value and the mean.
The variance is computed as follows:
2  ( xi  x )
s 
n 1
2
for a sample
2
(
x

m
)

i
2 
N
for a population
Standard Deviation


The standard deviation of a data set is the
positive square root of the variance.
It is measured in the same units as the data,
making it more easily interpreted than the
variance.
Standard Deviation
The standard deviation is computed as follows:
s s
2
for a sample
  2
for a population
Coefficient of Variation
The coefficient of variation indicates how large
the standard deviation is in relation to the mean.
The coefficient of variation is computed as follows:
s


100

%
x

for a sample


 100  %
m

for a population
Variance, Standard Deviation,
And Coefficient of Variation

Variance

Standard Deviation
s  s 2  2996.47 
54.74
Variance, Standard Deviation,
And Coefficient of Variation

Coefficient of Variation
 s

 54.74


100
%


100



 %  11.15%
x

 490.80

the standard deviation is about 11% of
of the mean .
Measures of Distribution Shape,
Relative Location, and Detecting
Outliers





Distribution Shape
z-Scores
Chebyshev’s Theorem
Empirical Rule
Detecting Outliers
Distribution Shape: Skewness(偏
度)



An important measure of the shape of a
distribution is called skewness.
Skewness is a measure of symmetry, or
more precisely, the lack of symmetry.
The formula for computing skewness for a
data set is somewhat complex.
Note: The formula for the skewness of
sample data
3
n
 xi  x 
skewness 



(n  1)(n  2)  s 
Distribution Shape: Skewness

Skewness can be easily computed using
statistical software.
Distribution Shape: Skewness



Symmetric (not skewed)
Skewness is zero.
Mean and median are equal.
Distribution Shape: Skewness
Relative Frequency
.35
.30
.25
.20
.15
.10
.05
0
Skewness = 0
Distribution Shape: Skewness

Moderately Skewed Left
Skewness is negative.

Mean will usually be less than the median.

Distribution Shape: Skewness
Relative Frequency
.35
.30
.25
.20
.15
.10
.05
0
Skewness = - .31
Distribution Shape: Skewness


Moderately Skewed Right
Skewness is positive.
Mean will usually be more than the median.
.35
Relative Frequency

.30
.25
.20
.15
.10
.05
0
Skewness = .31
Distribution Shape: Skewness



Highly Skewed Right
Skewness is positive (often above 1.0).
Mean will usually be more than the median.
Distribution Shape: Skewness
Relative Frequency
.35
.30
.25
.20
.15
.10
.05
0
Skewness = 1.25
Distribution Shape: Skewness
Distribution Shape: Skewness

Example: Apartment Rents
Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next
slide.
Distribution Shape: Skewness
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Distribution Shape: Skewness
Relative Frequency
.35
.30
.25
.20
.15
.10
.05
0
Skewness = .92
z-Scores
The z-score is often called the standardized value.
It denotes the number of standard deviations a
data value xi is from the mean.
xi  x
zi 
s
z-Scores




An observation’s z-score is a measure of the
relative location of the observation in a data
set.
A data value less than the sample mean will
have a z-score less than zero.
A data value greater than the sample mean
will have a z-score greater than zero.
A data value equal to the sample mean will
have a z-score of zero.
z-Scores

z-Score of Smallest Value (425)
Standardized Values for Apartment Rents
-1.20
-0.93
-0.75
-0.47
-0.20
0.35
1.54
-1.11
-0.93
-0.75
-0.38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0.17
1.06
1.99
-1.02
-0.84
-0.56
-0.20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0.17
1.45
2.27
-0.93
-0.75
-0.47
-0.20
0.35
1.45
2.27
Chebyshev’s Theorem

At least (1 - 1/z2) of the items in any data set will
be within z standard deviations of the mean,
where z is any value greater than 1.
Chebyshev’s Theorem

At least 75% of the data values must be
within z = 2 standard deviations of the mean.

At least 89% of the data values must be
within z = 3 standard deviations of the mean.

At least 94% of the data values must be
within z = 4 standard deviations of the mean.
Chebyshev’s Theorem
For example:
wi xi

Let z = 1.5 with x = 490.80 and s = 54.74
w
i
At least (1 - 1/(1.5)2) = 1 - 0.44 = 0.56 or 56%
of the rent values must be between
wx

x - z(s) = 490.80 - 1.5(54.74) = 409
w

w and
x
x 
+ z(s) = 490.80 + 1.5(54.74) = 573
w
i i
i
i
i
i
(Actually, 86% of the rent values are between
409 and 573.)
Empirical Rule
For data having a bell-shaped distribution:
68.26% of the values of a normal random variable
are within +/- 1 standard deviation
of its mean.
95.44% of the values of a normal random variable
are within +/- 2 standard deviations of its mean.
99.72% of the values of a normal random variable
are within +/- 3 standard deviations of its mean.
Empirical Rule
99.72%
95.44%
68.26%
m – 3
m – 1
m – 2
m
m + 3
m + 1
m + 2
x
Detecting Outliers

An outlier is an unusually small or unusually
large value in a data set.

A data value with a z-score less than -3 or
greater than +3 might be considered an outlier.
Detecting Outliers
It might be:
 an incorrectly recorded data value
 a data value that was incorrectly included
in the data set
 a correctly recorded data value that belongs
in the data set
Detecting Outliers

The most extreme z-scores are -1.20 and 2.27

Using |z| > 3 as the criterion for an outlier,
there are no outliers in this data set.
Exploratory Data Analysis


Five-Number Summary
Box Plot
Five-Number Summary
1
Smallest Value
2
First Quartile
3
Median
4
Third Quartile
5
Largest Value
Five-Number Summary

Example: Monthly Starting Salaries for a
sample of 12 Business School Graduates

Five-Number Summary
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Q1=2865
Q2=2905
Q3=3000
(Median)
Five-Number Summary
Lowest Value = 425 First Quartile = 445
Median = 475
Third Quartile = 525 Largest Value = 615
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Box Plot

A box is drawn with its ends located at
the first and third quartiles.

A vertical line is drawn in the box at the
location of the median (second quartile).
Box Plot
375 400 425 450 475 500 525 550 575 600 625
Q1 = 445
Q3 = 525
Q2 = 475
Box Plot



Limits are located (not drawn) using the
interquartile range (IQR).
Data outside these limits are considered
outliers.
The locations of each outlier is shown with
the symbol * .
… continued
Box Plot

The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(75)
=332.5
 The upper limit is located 1.5(IQR) above Q3.
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75)
= 637.5

There are no outliers (values less than 332.5 or
greater than 637.5) in the apartment rent data.
Box Plot

Whiskers (dashed lines) are drawn from the
ends of the box to the smallest and largest
data values inside the limits.
375 400 425 450 475 500 525 550 575 600 625
Smallest value
inside limits = 425
Largest value
inside limits = 615
Box Plot


Example: Monthly Starting Salaries for a
Sample of 12 Business School Graduates
Box Plot
Measures of Association
Between Two Variables


Covariance
Correlation Coefficient
Covariance

The covariance is a measure of the linear
association between two variables.

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.
Covariance
The covariance is computed as follows:
sxy
 xy
 ( xi  x )( yi  y )

n 1
 ( xi  m x )( yi  m y )

N
for
samples
for
populations
Covariance


Example: Sample Data for the Stereo and
Sound Equipment Store
Data
Covariance

Scatter Diagram for the Stereo and Sound
Equipment Store

Sample Covariance
S xy
(x


i
 x )( y i  y )
n 1
99

 11
9
Covariance

Partitioned Scatter Diagram for the Stereo
and Sound Equipment Store
Correlation Coefficient

The coefficient can take on values between
-1 and +1.

Values near -1 indicate a strong negative linear
relationship.

Values near +1 indicate a strong positive linear
relationship.
Correlation Coefficient

The correlation coefficient is computed as follows:
rxy 
sxy
sx s y
for
samples
where
Sx 

x  x
 xy
for
populations
2
i
n 1
 xy

 x y
Sy 

y  y 
2
i
n 1
Correlation Coefficient

Correlation is a measure of linear association
and not necessarily causation.

Just because two variables are highly correlated,
it does not mean that one variable is the cause of
the other.
Covariance and Correlation
Coefficient
A golfer is interested in investigating
the relationship, if any, between driving
distance and 18-hole score.
Average Driving Average
Distance (yds.) 18-Hole Score
277.6
259.5
269.1
267.0
255.6
272.9
69
71
70
70
71
69
Covariance and Correlation
Coefficient
x
277.6
259.5
269.1
267.0
255.6
272.9
y
69
71
70
70
71
69
Average
267.0 70.0
Std. Dev. 8.2192 .8944
10.65
-7.45
2.15
0.05
-11.35
5.95
-1.0
1.0
0
0
1.0
-1.0
-10.65
-7.45
0
0
-11.35
-5.95
Total -35.40
Covariance and Correlation
Coefficient

Sample Covariance
sxy

(x  x )( y  y ) 35.40




i
i
n1
61
 7.08
Sample Correlation Coefficient
sxy
7.08
rxy 

 -.9631
sx sy (8.2192)(.8944)
The Weighted Mean and
Working with Grouped Data




Weighted Mean
Mean for Grouped Data
Variance for Grouped Data
Standard Deviation for Grouped Data
Weighted Mean



When the mean is computed by giving each
data value a weight that reflects its importance,
it is referred to as a weighted mean.
In the computation of a grade point average
(GPA), the weights are the number of credit
hours earned for each grade.
When data values vary in importance, the
analyst must choose the weight that best
reflects the importance of each value.
Weighted Mean
wx

x
w
i i
where:
i
xi = value of observation i
wi = weight for observation i
Grouped Data


The weighted mean computation can be
used to obtain approximations of the mean,
variance, and standard deviation for the
grouped data.
To compute the weighted mean, we treat the
midpoint of each class as though it were the
mean of all items in the class.
Grouped Data

We compute a weighted mean of the class
midpoints using the class frequencies as
weights.

Similarly, in computing the variance and
standard deviation, the class frequencies are
used as weights.
Mean for Grouped Data

Sample Data
fM

x
i
i
n

Population Data
fM

m
i
i
N
where:
fi = frequency of class i
Mi = midpoint of class i
Sample Mean for Grouped
Data
Given below is the previous sample of
monthly rents for 70 efficiency apartments,
presented here as grouped Rent ($) Frequency
420-439
8
data in the form of a
440-459
17
460-479
12
frequency distribution.
480-499
500-519
520-539
540-559
560-579
580-599
600-619
8
7
4
2
4
2
6
Sample Mean for Grouped
Data
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
34, 525
x
 493.21
70
This approximation
differs by $2.41
from the actual
sample mean of
$490.80.
Variance for Grouped Data

For sample data
2
f
(
M

x
)

i
i
s2 
n 1
2

For population data
2
f
(
M

m
)

i
i
2 
N
2
Sample Variance for
Grouped Data
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
Mi - x
-63.7
-43.7
-23.7
-3.7
16.3
36.3
56.3
76.3
96.3
116.3
(M i - x )2 f i (M i - x )2
4058.96 32471.71
1910.56 32479.59
562.16
6745.97
13.76
110.11
265.36
1857.55
1316.96
5267.86
3168.56
6337.13
5820.16 23280.66
9271.76 18543.53
13523.36 81140.18
208234.29
continued
Sample Variance for
Grouped Data

Sample Variance
s2 = 208,234.29/(70 – 1) = 3,017.89

Sample Standard Deviation
s  3,017.89  54.94
This approximation differs by only $.20
from the actual standard deviation of $54.74.
Related documents