Download ppt3b

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Slides by
John
Loucks
St. Edward’s
University
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 1
Chapter 3, Part B
Descriptive Statistics: Numerical Measures

Measures of Distribution Shape, Relative Location,
and Detecting Outliers

Exploratory Data Analysis

Measures of Association Between Two Variables

The Weighted Mean and
Working with Grouped Data
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 2
Measures of Distribution Shape,
Relative Location, and Detecting Outliers





Distribution Shape
z-Scores
Chebyshev’s Theorem
Empirical Rule
Detecting Outliers
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 3
Distribution Shape: Skewness

An important measure of the shape of a distribution
is called skewness.

The formula for the skewness of sample data is
n
 xi  x 
Skewness 



(n  1)(n  2)  s 

3
Skewness can be easily computed using statistical
software.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 4
Distribution Shape: Skewness

Symmetric (not skewed)
• Skewness is zero.
• Mean and median are equal.
Relative Frequency
.35
Skewness = 0
.30
.25
.20
.15
.10
.05
0
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 5
Distribution Shape: Skewness

Moderately Skewed Left
• Skewness is negative.
• Mean will usually be less than the median.
Relative Frequency
.35
Skewness =  .31
.30
.25
.20
.15
.10
.05
0
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 6
Distribution Shape: Skewness

Moderately Skewed Right
• Skewness is positive.
• Mean will usually be more than the median.
Relative Frequency
.35
Skewness = .31
.30
.25
.20
.15
.10
.05
0
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 7
Distribution Shape: Skewness

Highly Skewed Right
• Skewness is positive (often above 1.0).
• Mean will usually be more than the median.
Relative Frequency
.35
Skewness = 1.25
.30
.25
.20
.15
.10
.05
0
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 8
Distribution Shape: Skewness

Example: Apartment Rents
Seventy efficiency apartments were randomly
sampled in a college town. The monthly rent prices
for the apartments are listed below in ascending order.
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
440
450
465
480
500
570
615
440
450
465
480
510
570
615
Slide 9
Distribution Shape: Skewness

Example: Apartment Rents
Relative Frequency
.35
Skewness = .92
.30
.25
.20
.15
.10
.05
0
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 10
z-Scores
The z-score is often called the standardized value.
It denotes the number of standard deviations a data
value xi is from the mean.
xi  x
zi 
s
Excel’s STANDARDIZE function can be used to
compute the z-score.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 11
z-Scores
 An observation’s z-score is a measure of the relative
location of the observation in a data set.
 A data value less than the sample mean will have a
z-score less than zero.
 A data value greater than the sample mean will have
a z-score greater than zero.
 A data value equal to the sample mean will have a
z-score of zero.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 12
z-Scores
 Example: Apartment Rents
• z-Score of Smallest Value (425)
xi  x 425  490.80
z

  1.20
s
54.74
Standardized Values for Apartment Rents
-1.20
-0.93
-0.75
-0.47
-0.20
0.35
1.54
-1.11
-0.93
-0.75
-0.38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0.17
1.06
1.99
-1.02
-0.84
-0.56
-0.20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0.17
1.45
2.27
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
-0.93
-0.75
-0.47
-0.20
0.35
1.45
2.27
Slide 13
Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be
within z standard deviations of the mean, where z is
any value greater than 1.
Chebyshev’s theorem requires z > 1, but z need not
be an integer.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 14
Chebyshev’s Theorem
At least 75% of the data values must be
within z = 2 standard deviations of the mean.
At least 89% of the data values must be
within z = 3 standard deviations of the mean.
At least 94% of the data values must be
within z = 4 standard deviations of the mean.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 15
Chebyshev’s Theorem
 Example: Apartment Rents
Let z = 1.5 with x = 490.80 and s = 54.74
At least (1  1/(1.5)2) = 1  0.44 = 0.56 or 56%
of the rent values must be between
x - z(s) = 490.80  1.5(54.74) = 409
and
x + z(s) = 490.80 + 1.5(54.74) = 573
(Actually, 86% of the rent values
are between 409 and 573.)
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 16
Empirical Rule
When the data are believed to approximate a
bell-shaped distribution …
The empirical rule can be used to determine the
percentage of data values that must be within a
specified number of standard deviations of the
mean.
The empirical rule is based on the normal
distribution, which is covered in Chapter 6.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 17
Empirical Rule
For data having a bell-shaped distribution:
68.26% of the values of a normal random variable
are within +/- 1 standard deviation of its mean.
95.44% of the values of a normal random variable
are within +/- 2 standard deviations of its mean.
99.72% of the values of a normal random variable
are within +/- 3 standard deviations of its mean.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 18
Empirical Rule
99.72%
95.44%
68.26%
m – 3s
m – 1s
m – 2s
m
m + 3s
m + 1s
m + 2s
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
x
Slide 19
Detecting Outliers
 An outlier is an unusually small or unusually large
value in a data set.
 A data value with a z-score less than -3 or greater
than +3 might be considered an outlier.
 It might be:
• an incorrectly recorded data value
• a data value that was incorrectly included in the
data set
• a correctly recorded data value that belongs in
the data set
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 20
Detecting Outliers
 Example: Apartment Rents
• The most extreme z-scores are -1.20 and 2.27
• Using |z| > 3 as the criterion for an outlier, there
are no outliers in this data set.
Standardized Values for Apartment Rents
-1.20
-0.93
-0.75
-0.47
-0.20
0.35
1.54
-1.11
-0.93
-0.75
-0.38
-0.11
0.44
1.54
-1.11
-0.93
-0.75
-0.38
-0.01
0.62
1.63
-1.02
-0.84
-0.75
-0.34
-0.01
0.62
1.81
-1.02
-0.84
-0.75
-0.29
-0.01
0.62
1.99
-1.02
-0.84
-0.56
-0.29
0.17
0.81
1.99
-1.02
-0.84
-0.56
-0.29
0.17
1.06
1.99
-1.02
-0.84
-0.56
-0.20
0.17
1.08
1.99
-0.93
-0.75
-0.47
-0.20
0.17
1.45
2.27
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
-0.93
-0.75
-0.47
-0.20
0.35
1.45
2.27
Slide 21
Exploratory Data Analysis
Exploratory data analysis procedures enable us to use
simple arithmetic and easy-to-draw pictures to
summarize data.
We simply sort the data values into ascending order
and identify the five-number summary and then
construct a box plot.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 22
Five-Number Summary
1
Smallest Value
2
First Quartile
3
Median
4
Third Quartile
5
Largest Value
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 23
Five-Number Summary
 Example: Apartment Rents
First Quartile = 445
Lowest Value = 425
Median = 475
Third Quartile = 525 Largest Value = 615
425
440
450
465
480
510
575
430
440
450
470
485
515
575
430
440
450
470
490
525
580
435
445
450
472
490
525
590
435
445
450
475
490
525
600
435
445
460
475
500
535
600
435
445
460
475
500
549
600
435
445
460
480
500
550
600
440
450
465
480
500
570
615
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
440
450
465
480
510
570
615
Slide 24
Box Plot
A box plot is a graphical summary of data that is
based on a five-number summary.
A key to the development of a box plot is the
computation of the median and the quartiles Q1 and
Q3.
Box plots provide another way to identify outliers.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 25
Box Plot
 Example: Apartment Rents
• A box is drawn with its ends located at the first and
third quartiles.
• A vertical line is drawn in the box at the location of
the median (second quartile).
400 425 450 475 500 525 550 575 600 625
Q1 = 445
Q3 = 525
Q2 = 475
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 26
Box Plot
 Limits are located (not drawn) using the interquartile
range (IQR).
 Data outside these limits are considered outliers.
 The locations of each outlier is shown with the
symbol * .
continued
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 27
Box Plot
 Example: Apartment Rents
•
The lower limit is located 1.5(IQR) below Q1.
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325
•
The upper limit is located 1.5(IQR) above Q3.
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645
• There are no outliers (values less than 325 or
greater than 645) in the apartment rent data.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 28
Box Plot
 Example: Apartment Rents
• Whiskers (dashed lines) are drawn from the ends
of the box to the smallest and largest data values
inside the limits.
400 425 450 475 500 525 550 575 600 625
Smallest value
inside limits = 425
Largest value
inside limits = 615
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 29
Measures of Association
Between Two Variables
Thus far we have examined numerical methods used
to summarize the data for one variable at a time.
Often a manager or decision maker is interested in
the relationship between two variables.
Two descriptive measures of the relationship
between two variables are covariance and correlation
coefficient.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 30
Covariance
The covariance is a measure of the linear association
between two variables.
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 31
Covariance
The covariance is computed as follows:
sxy
 ( xi  x )( yi  y )

n 1
 ( xi  m x )( yi  m y )
s xy 
N
for
samples
for
populations
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 32
Correlation Coefficient
Correlation is a measure of linear association and not
necessarily causation.
Just because two variables are highly correlated, it
does not mean that one variable is the cause of the
other.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 33
Correlation Coefficient
The correlation coefficient is computed as follows:
rxy 
sxy
sx s y
for
samples
 xy
s xy

s xs y
for
populations
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 34
Correlation Coefficient
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear
relationship.
Values near +1 indicate a strong positive linear
relationship.
The closer the correlation is to zero, the weaker the
relationship.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 35
Covariance and Correlation Coefficient
 Example: Golfing Study
A golfer is interested in investigating the
relationship, if any, between driving distance and
18-hole score.
Average Driving
Average
Distance (yds.) 18-Hole Score
277.6
69
259.5
71
269.1
70
267.0
70
255.6
71
272.9
69
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 36
Covariance and Correlation Coefficient
 Example: Golfing Study
x
y
277.6
259.5
269.1
267.0
255.6
272.9
69
71
70
70
71
69
Average 267.0 70.0
Std. Dev. 8.2192 .8944
( xi  x ) ( yi  y ) ( xi  x )( yi  y )
10.65
-7.45
2.15
0.05
-11.35
5.95
-1.0
1.0
0
0
1.0
-1.0
-10.65
-7.45
0
0
-11.35
-5.95
Total -35.40
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 37
Covariance and Correlation Coefficient
 Example: Golfing Study
• Sample Covariance
sxy
( x  x )( y


i
n1
i
 y)
35.40

  7.08
61
• Sample Correlation Coefficient
sxy
7.08
rxy 

 -.9631
sx sy (8.2192)(.8944)
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 38
Using Excel to Compute the
Covariance and Correlation Coefficient
 Example: Golfing Study
• Excel Formula Worksheet
1
2
3
4
5
6
7
8
A
B
C
D
Average 18-Hole
Drive
Score
277.6
69
Pop. Covariance =COVARIANCE.S(A2:A7,B2:B7)
259.5
71 Samp. Correlation =CORREL(A2:A7,B2:B7)
269.1
70
267.0
70
255.6
71
272.9
69
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 39
Using Excel to Compute the
Covariance and Correlation Coefficient
 Example: Golfing Study
• Excel Value Worksheet
1
2
3
4
5
6
7
8
A
B
C
Average 18-Hole
Drive
Score
277.6
69
Pop. Covariance
259.5
71 Samp. Correlation
269.1
70
267.0
70
255.6
71
272.9
69
D
-5.9
-0.9631
Sample Covariance = sxy = n/(n – 1)sxy = 6/(6 – 1)(-5.9) = -7.08
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 40
The Weighted Mean and
Working with Grouped Data




Weighted Mean
Mean for Grouped Data
Variance for Grouped Data
Standard Deviation for Grouped Data
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 41
Weighted Mean
 When the mean is computed by giving each data
value a weight that reflects its importance, it is
referred to as a weighted mean.
 In the computation of a grade point average (GPA),
the weights are the number of credit hours earned for
each grade.
 When data values vary in importance, the analyst
must choose the weight that best reflects the
importance of each value.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 42
Weighted Mean
wx

x
w
i i
i
where:
xi = value of observation i
wi = weight for observation i
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 43
Grouped Data
 The weighted mean computation can be used to
obtain approximations of the mean, variance, and
standard deviation for the grouped data.
 To compute the weighted mean, we treat the
midpoint of each class as though it were the mean
of all items in the class.
 We compute a weighted mean of the class midpoints
using the class frequencies as weights.
 Similarly, in computing the variance and standard
deviation, the class frequencies are used as weights.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 44
Mean for Grouped Data
 Sample Data
fM

x
i
i
n
 Population Data
fM

m
i
i
N
where:
fi = frequency of class i
Mi = midpoint of class i
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 45
Sample Mean for Grouped Data
 Example: Apartment Rents
The previously presented sample of apartment
rents is shown here as grouped data in the form of
a frequency distribution.
Rent ($) Frequency
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
8
17
12
8
7
4
2
4
2
6
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 46
Sample Mean for Grouped Data
 Example: Apartment Rents
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569.5
589.5
609.5
f iM i
3436.0
7641.5
5634.0
3916.0
3566.5
2118.0
1099.0
2278.0
1179.0
3657.0
34525.0
34, 525
x
 493.21
70
This approximation
differs by $2.41 from
the actual sample
mean of $490.80.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 47
Variance for Grouped Data
 For sample data
2
 f i ( Mi  x )
s 
n 1
2
 For population data
2
f
(
M

m
)

i
i
s2 
N
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 48
Sample Variance for Grouped Data
 Example: Apartment Rents
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
fi
8
17
12
8
7
4
2
4
2
6
70
Mi
429.5
449.5
469.5
489.5
509.5
529.5
549.5
569.5
589.5
609.5
Mi - x
-63.7
-43.7
-23.7
-3.7
16.3
36.3
56.3
76.3
96.3
116.3
(M i - x )2 f i (M i - x )2
4058.96 32471.71
1910.56 32479.59
562.16
6745.97
13.76
110.11
265.36
1857.55
1316.96
5267.86
3168.56
6337.13
5820.16 23280.66
9271.76 18543.53
13523.36 81140.18
208234.29
continued
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 49
Sample Variance for Grouped Data
 Example: Apartment Rents
• Sample Variance
s2 = 208,234.29/(70 – 1) = 3,017.89
• Sample Standard Deviation
s  3,017.89  54.94
This approximation differs by only $.20
from the actual standard deviation of $54.74.
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 50
End of Chapter 3, Part B
© 2012 Cengage Learning. All Rights Reserved. May not be scanned, copied
or duplicated, or posted to a publicly accessible website, in whole or in part.
Slide 51