Download printable chapter 3 slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Chapter 3
3-1
Chapter Goals
After completing this chapter, you should be able to:
Compute and interpret the mean, median, and mode for a
set of data
„
Find the range, variance, and standard deviation and know
what these values mean
„
Chapter 3
Numerical Descriptive Measures
„
Construct and interpret a box and whiskers plot
„
Compute and explain the coefficient of variation
Use numerical measures along with graphs, charts, and
tables to describe data
„
Chap 3-1
Yandell – GSBA 502
Yandell – GSBA 502
Chap 3-2
Summary Measures
Chapter Topics
„
Measures of Center and Location
„
„
Other measures of Location
„
„
Describing Data Numerically
Mean, median, mode, geometric mean, midrange
Center and Location
Weighted mean, percentiles, quartiles
Skewness (shape)
„
Linear correlation
Yandell – GSBA 502
Range
Interquartile Range
Quartiles
Mode
Variance
Weighted Mean
Yandell – GSBA 502
Chap 3-4
Measures of Center and Location
Overview
Population Parameters are denoted with a letter
from the Greek alphabet:
„
Center and Location
: (mu) represents the population mean
F (sigma) represents the population standard deviation
Mean
„
Sample Statistics are commonly denoted with
letters from the Roman alphabet:
„
Standard Deviation
Coefficient of
Variation
Chap 3-3
Notation Conventions
„
Skewness
Percentiles
Median
Range, interquartile range, variance and standard
deviation, coefficient of variation
„
Variation
Mean
Measures of Variation
„
Other Measures
of Location
X=
_
X (X-bar) represents the sample mean
S represents the sample standard deviation
∑X
i =1
n
N
μ=
Yandell – GSBA 502
Median
Mode
Weighted Mean
n
Chap 3-5
∑X
i =1
N
Yandell – GSBA 502
i
∑w X
∑w
∑w X
=
∑w
i
XW =
i
Midpoint of
ranked
values
Most
frequently
observed
value
i
i
μW
i
i
i
Chap 3-6
Chapter 3
3-2
Mean (Arithmetic Average)
Mean (Arithmetic Average)
(continued)
„
The Mean is the arithmetic average of data
values
„
„
∑ Xi
=
i=1
n
Population mean
„
X1 + X 2 + L + Xn
n
The most common measure of central tendency
Mean = sum of values divided by the number of values
Affected by extreme values (outliers)
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
N = Population Size
N
μ=
„
n = Sample Size
Sample mean n
X=
„
∑X
i=1
N
Mean = 3
i
=
X1 + X 2 + L + XN
N
Yandell – GSBA 502
Mean = 4
1 + 2 + 3 + 4 + 10 20
=
=4
5
5
1 + 2 + 3 + 4 + 5 15
=
=3
5
5
Chap 3-7
Yandell – GSBA 502
Chap 3-8
Median
Median
(continued)
„
Not affected by extreme values
„
0 1 2 3 4 5 6 7 8 9 10
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
„
„
In an ordered array, the median is the “middle”
number (50% above, 50% below)
„
„
To find the median, rank the n values in
order of magnitude
Find the value in the (n+1)/2 position
„
If n or N is odd, the median is the middle number
If n or N is even, the median is the average of the
two middle numbers
Yandell – GSBA 502
Chap 3-9
If n is an even number, let the median be
the mean of the two middle-most
observations.
Yandell – GSBA 502
Chap 3-10
Mode
„
„
„
„
„
„
Weighted Mean
A measure of central tendency
Value that occurs most often
Not affected by extreme values
Used for either numerical or categorical data
There may may be no mode
There may be several modes
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 5
Yandell – GSBA 502
„
Used when values are grouped by frequency or
relative importance
Example: Sample of
26 Repair Projects
Days to
Complete
0 1 2 3 4 5 6
Weighted Mean Days
to Complete:
Frequency
5
4
6
12
7
8
8
2
XW =
∑w X
∑w
i
i
=
(4 × 5) + (12 × 6) + (8 × 7) + (2 × 8)
4 + 12 + 8 + 2
=
164
= 6.31 days
26
i
No Mode
Chap 3-11
Yandell – GSBA 502
Chap 3-12
Chapter 3
3-3
Review Example
„
Summary Statistics
Five houses on a hill by the beach
House Prices:
$2,000 K
$2,000,000
500,000
300,000
100,000
100,000
House Prices:
$2,000,000
500,000
300,000
100,000
100,000
$500 K
$300 K
„
„
Sum $3,000,000
„
$100 K
Mean:
($3,000,000/5)
= $600,000
Median: middle value of ranked data
= $300,000
Mode: most frequent value
= $100,000
$100 K
Yandell – GSBA 502
Chap 3-13
Yandell – GSBA 502
Chap 3-14
Which measure of location
is the “best”?
„
„
Note
Mean is generally used, unless
extreme values (outliers) exist
„
Then median is often used, since
the median is not sensitive to
extreme values
„
„
The mean and median values do not have to
be values that are part of the data set.
Example (four observations): 2 , 3 , 4 , 11
Mean = (2+3+4+11)/4 = 5
Median = 3.5
Example: Median home prices may be
reported for a region – less sensitive to
outliers
(The median position is (N+1)/2 = 2.5th position,
so use the midpoint of middle-most values)
Yandell – GSBA 502
Chap 3-15
Yandell – GSBA 502
Measuring Skewness
Shape of a Distribution
„
Describes how data is distributed
„
Symmetric or skewed
Left-Skewed
Symmetric
Chap 3-16
„
A number called the coefficient of skewness (SK) is
commonly used to measure skewness:
SK =
Right-Skewed
where
Yandell – GSBA 502
S
S is the sample standard deviation,
_
X is the sample mean, and
Median is the sample median.
Mean < Median < Mode Mean = Median = Mode Mode < Median < Mean
(Longer tail extends to left)
_
3 (X ! Median)
(Longer tail extends to right)
Chap 3-17
Yandell – GSBA 502
Chap 3-18
Chapter 3
3-4
Measuring Skewness
Other Location Measures
(continued)
„
The magnitude of SK indicates the degree of
skewness, where
Other Measures
of Location
-3 # SK # +3
6
6
6
SK < 0
SK = 0
SK > 0
Percentiles
skewed left
symmetric (not skewed)
skewed right
The pth percentile in a data array:
„
Yandell – GSBA 502
Chap 3-19
„
3rd quartile = 75th percentile
Chap 3-20
Quartiles split the ranked data into 4 equal
groups
25% 25%
25%
25%
Q1
„
Q2
Q3
Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
Example: The 60th percentile in an ordered array of 19
values is the value in 12th position:
(n = 9)
25 (9+1) = 2.5 position
100
so use the value half way between the 2nd and 3rd values,
Q1 = 25th percentile, so find the
p
60
i=
(n + 1) =
(19 + 1) = 12
100
100
so
Chap 3-21
Q1 = 12.5
Yandell – GSBA 502
Chap 3-22
Quartile Formulas
Box and Whisker Plot
„
Find a quartile by determining the value in the xth
position of the ranked data, where
First quartile:
2nd quartile = 50th percentile
= median
Yandell – GSBA 502
„
p
(n + 1)
100
Yandell – GSBA 502
„
1st quartile = 25th percentile
Quartiles
The pth percentile in an ordered array of n
values is the value in ith position, where
i=
„
(where 0 ≤ p ≤ 100)
Percentiles
„
„
p% are less than or equal to this
value
(100 – p)% are greater than or
equal to this value
„
The coefficient of skewness is calculated and reported
by many computer statistical software packages.
„
Quartiles
A Graphical display of data using 5-number
summary:
Minimum -- Q1 -- Median -- Q3 -- Maximum
Q1 = (n+1)/4
Example:
Second quartile: Q2 = (n+1)/2 (the median position)
25%
Third quartile:
Minimum
where n is the number of observed values
Yandell – GSBA 502
25%
25%
25%
Q3 = 3(n+1)/4
Minimum
Chap 3-23
Yandell – GSBA 502
1st
1st
Quartile
Quartile
Median
Median
3rd
3rd
Quartile
Quartile
Maximum
Maximum
Chap 3-24
Chapter 3
3-5
Distribution Shape and
Box and Whisker Plot
Shape of Box and Whisker Plots
„
„
The Box and central line are centered between the
endpoints if data is symmetric around the median
Left-Skewed
A Box and Whisker plot can be shown in either vertical
or horizontal format
Yandell – GSBA 502
Chap 3-25
Q1
Below is a Box-and-Whisker plot for the following
data:
Min
0
Q1
2
Q2
2
2
3
3
Q3
4
5
00 22 33 55
„
5
„
Chap 3-27
Interquartile
Range
Population
Variance
Sample
Variance
Yandell – GSBA 502
Chap 3-28
Variation
Variation
Standard Deviation
If you have several data sets and wish to make
comparisons, PHStat can create multiple plots
in the same display window
Yandell – GSBA 502
Measures of Variation
Variance
The PHStat add-in can be used to easily
create a Box-and-Whisker Plot.
Click here to see a Box-and-Whisker
plot created using PHStat
This data is very right skewed, as the plot depicts
Range
Chap 3-26
„
27
227
7
Yandell – GSBA 502
Q1 Q2 Q3
Using PHStat to construct a
Box-and-Whisker Plot
Max
10
Q1 Q2 Q3
Right-Skewed
Yandell – GSBA 502
Box-and-Whisker Plot Example
„
Q2 Q3
Symmetric
„
Coefficient of
Variation
Measures of variation give information on
the spread or variability of the data
values.
Population
Standard
Deviation
Sample
Standard
Deviation
Same center,
different variation
Chap 3-29
Yandell – GSBA 502
Chap 3-30
Chapter 3
3-6
Range
„
„
Disadvantages of the Range
Simplest measure of variation
Difference between the largest and the smallest
observations:
Ignores the way in which data are distributed
„
7
8
9
10
11
12
7
8
Range = 12 - 7 = 5
Range = xmaximum – xminimum
9
10
11
12
Range = 12 - 7 = 5
Sensitive to outliers
„
Example:
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 14 - 1 = 13
Range = 120 - 1 = 119
Yandell – GSBA 502
Chap 3-31
Yandell – GSBA 502
Chap 3-32
Interquartile Range
Interquartile Range
„
„
„
Can eliminate some outlier problems by using
the interquartile range
Example:
X
25%
Eliminate some high- and low-valued
observations and calculate the range from the
remaining values
12
Chap 3-33
Sample variance:
n
S2 =
„
σ =
listen
„
„
„
2
70
Chap 3-34
Most commonly used measure of variation
Shows variation about the mean
Has the same units as the original data
„
n -1
∑ (X − μ)
i=1
57
i
N
Population variance:
2
Yandell – GSBA 502
∑ (X − X)
i=1
45
maximum
25%
Standard Deviation
Average of squared deviations of values from
the mean
„
30
25%
Yandell – GSBA 502
Variance
„
25%
X
Q3
Interquartile range
= 57 – 30 = 27
Interquartile range = 3rd quartile – 1st quartile
Yandell – GSBA 502
Median
(Q2)
Q1
minimum
n
Sample standard deviation:
S=
2
„
i
Population standard deviation:
Chap 3-35
Yandell – GSBA 502
i=1
2
i
n -1
N
σ=
N
∑ (X − X)
∑ (X − μ)
i=1
2
i
N
Chap 3-36
Chapter 3
3-7
Calculation Example:
Sample Standard Deviation
Sample
Data (Xi) :
10
12
14
15
n=8
18
18
24
Small standard deviation
Mean = X = 16
(10 − x ) 2 + (12 − x ) 2 + (14 − x ) 2 + L + (24 − x ) 2
n −1
s =
=
17
Measuring variation
(10 − 16)
126
7
=
2
+ (12 − 16)
=
2
+ (14 − 16)
8 −1
2
+ L + (24 − 16)
Large standard deviation
2
4.2426
Yandell – GSBA 502
Chap 3-37
Yandell – GSBA 502
Advantages of Variance and
Standard Deviation
Comparing Standard Deviations
Data A
11
12
13
14
15
16
17
18
19
20 21
14
15
16
17
18
19
20 21
Data B
11
12
13
12
13
Mean = 15.5
s = 3.338
„
Mean = 15.5
s = .9258
„
14
15
16
17
18
19
20 21
Chap 3-39
Yandell – GSBA 502
The Empirical Rule
„
Chap 3-40
The Empirical Rule
If the data distribution is bell-shaped, then
the interval:
„
„
μ ± 1σ contains about 68% of the values in
the population or the sample
X
μ ± 2σ contains about 95% of the values in
the population or the sample
μ ± 3σ contains about 99.7% of the values
in the population or the sample
68%
μ
μ ± 1σ
Yandell – GSBA 502
Values far from the mean are given extra
weight
Mean = 15.5
s = 4.57
Yandell – GSBA 502
„
Each value in the data set is used in the
calculation
(because deviations from the mean are squared)
Data C
11
Chap 3-38
Chap 3-41
Yandell – GSBA 502
95%
99.7%
μ ± 2σ
μ ± 3σ
Chap 3-42
Chapter 3
3-8
Comparing Coefficient
of Variation
Coefficient of Variation
„
Measures relative variation
„
Always in percentage (%)
„
Shows variation relative to mean
„
„
⎛S⎞
$5
CVA = ⎜⎜ ⎟⎟ ⋅ 100% =
⋅ 100% = 10%
$50
⎝X⎠
Is used to compare two or more sets of data
measured in different units
Population
listen
⎛σ
CV = ⎜⎜
⎝ μ
„
⎛ S
CV = ⎜⎜
⎝ X
Stock B:
„
Sample
⎞
⎟⎟ ⋅ 100%
⎠
Stock A:
„ Average price last year = $50
„ Standard deviation = $5
„
⎞
⎟ ⋅ 100%
⎟
⎠
Yandell – GSBA 502
Chap 3-43
Average price last year = $100
Standard deviation = $5
⎛S⎞
$5
CVB = ⎜⎜ ⎟⎟ ⋅ 100% =
⋅ 100% = 5%
$100
X
⎝ ⎠
Yandell – GSBA 502
Chap 3-44
Using Microsoft Excel
„
Using Excel
Descriptive Statistics are easy to obtain
from Microsoft Excel
„
Both stocks
have the same
standard
deviation, but
stock B is less
variable relative
to its price
„
Use menu choice:
tools / data analysis /
descriptive statistics
Use menu choice:
tools / data analysis / descriptive statistics
„
Enter details in dialog box
Click here to open house price worksheet,
then follow steps shown below to obtain
descriptive statistics
Yandell – GSBA 502
Chap 3-45
Yandell – GSBA 502
Chap 3-46
Using Excel
Excel output
(continued)
Microsoft Excel
descriptive statistics output,
using the house price data:
„
„
„
House Prices:
Enter dialog box
details
$2,000,000
500,000
300,000
100,000
100,000
Check box for
summary statistics
Click here
to start demo
Click OK
Yandell – GSBA 502
Chap 3-47
Yandell – GSBA 502
Chap 3-48
Chapter 3
3-9
Scatter Plots and Correlation
„
„
Scatter Plot Examples
Strong relationships
A scatter plot (or scatter diagram) is used to show
the relationship between two variables
Y
Correlation analysis is used to measure strength
of the linear association (linear relationship)
between two variables
„
„
Weak relationships
Y
X
Only concerned with strength of the
relationship
X
Y
Y
No causal effect is implied
X
X
Yandell – GSBA 502
Chap 3-49
Yandell – GSBA 502
Chap 3-50
Scatter Plot Examples
Correlation Coefficient
(continued)
(continued)
No relationship
„
Y
„
X
Y
The population correlation coefficient ρ (rho)
measures the strength of the association
between the variables
The sample correlation coefficient r is an
estimate of ρ and is used to measure the
strength of the linear relationship in the
sample observations
X
Yandell – GSBA 502
Chap 3-51
Yandell – GSBA 502
Chap 3-52
Examples of Approximate
r Values
Features of ρ and r
„
„
„
„
„
Yandell – GSBA 502
Unit free
Range between -1 and 1
The closer to -1, the stronger the negative
linear relationship
The closer to 1, the stronger the positive
linear relationship
The closer to 0, the weaker the linear
relationship
Y
Y
Y
X
r = -1
r = -.6
Y
Chap 3-53
Yandell – GSBA 502
X
X
r=0
Y
r = +.3
X
r = +1
X
Chap 3-54
Chapter 3
3-10
Calculating the
Correlation Coefficient
Calculation Example
Sample correlation coefficient:
r=
∑ ( X − X)( Y − Y )
[∑ ( X − X) ][ ∑ ( Y − Y ) ]
2
2
or the algebraic equivalent:
n∑ XY − ∑ X∑ y
r=
[n( ∑ X ) − ( ∑ X) ][n( ∑ Y ) − ( ∑ Y ) ]
2
2
2
2
where:
r = Sample correlation coefficient
n = Sample size
X = Value of the independent variable
Y = Value of the dependent variable
Yandell – GSBA 502
Chap 3-55
Tree
Height
Trunk
Diameter
Y
X
XY
Y2
X2
35
8
280
1225
64
49
9
441
2401
81
27
7
189
729
49
33
6
198
1089
36
60
13
780
3600
169
21
7
147
441
49
45
11
495
2025
121
51
12
612
2601
144
Σ=321
Σ=73
Σ=3142
Σ=14111
Σ=713
Yandell – GSBA 502
Chap 3-56
Calculation Example
Excel Output
(continued)
Tree
Height,
Y 70
r=
n∑ XY − ∑ X∑ Y
60
=
50
40
Excel Correlation Output
[n( ∑ X ) − ( ∑ X) ][n( ∑ Y ) − ( ∑ Y) ]
2
2
2
Tools / data analysis / correlation…
2
8(3142) − (73)(321)
[8(713) − (73) 2 ][8(14111) − (321) 2 ]
Tree Height
Trunk Diameter
= 0.886
30
Tree Height Trunk Diameter
1
0.886231
1
20
10
0
0
2
4
6
8
10
12
14
Trunk Diameter, X
r = 0.886 → relatively strong positive
linear association between X and Y
Correlation between
Tree Height and Trunk Diameter
Yandell – GSBA 502
Chap 3-57
Yandell – GSBA 502
Chapter Summary
„
„
„
„
mean, median, mode
Discussed percentiles and quartiles
Described measure of variation
„
„
Final Demonstration
Described measures of center and location
„
„
Chap 3-58
Click here to see a demo of
a side-by-side box-and-whisker plot
and see how to get summary statistics
range, interquartile range, variance,
standard deviation, coefficient of variation
Created Box-and-Whisker plots
Illustrated distribution shapes (symmetric, skewed)
Discussed linear correlation
Yandell – GSBA 502
Chap 3-59
Yandell – GSBA 502
Chap 3-60