Download JOINT AND CONDITIONAL DISTRIBUTIONS

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Student's t-test wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
RANDOM VARIABLES,
EXPECTATIONS,
VARIANCES ETC.
1
Variable
• Recall:
• Variable: A characteristic of population or
sample that is of interest for us.
• Random variable: A function defined on the
sample space S that associates a real number
with each outcome in S.
2
DISCRETE RANDOM VARIABLES
• If the set of all possible values of a r.v. X is a
countable set, then X is called discrete r.v.
• The function f(x)=P(X=x) for x=x1,x2, … that
assigns the probability to each value x is called
probability density function (p.d.f.) or
probability mass function (p.m.f.)
3
Example
• Discrete Uniform distribution:
1
P(X  x )  ; x  1,2,..., N; N  1,2,...
N
• Example: throw a fair die.
P(X=1)=…=P(X=6)=1/6
4
CONTINUOUS RANDOM VARIABLES
• When sample space is uncountable
(continuous)
• Example: Continuous Uniform(a,b)
1
f (X) 
ba
a  x  b.
5
CUMULATIVE DENSITY FUNCTION
(C.D.F.)
• CDF of a r.v. X is defined as F(x)=P(X≤x).
• Note that, P(a<X ≤b)=F(b)-F(a).
• A function F(x) is a CDF for some r.v. X iff it
satisfies
lim
x  
lim
x 
lim
F( x )  1
h 0 
a b
F( x )  0
F( x  h )  F( x )
implies
F(x) is continuous from right
F(a )  F( b)
F(x) is non-decreasing.
6
Example
•
•
•
•
Consider tossing three fair coins.
Let X=number of heads observed.
S={TTT, TTH, THT, HTT, THH, HTH, HHT, HHH}
P(X=0)=P(X=3)=1/8; P(X=1)=P(X=2)=3/8
x
F(x)
(-∞,0)
0
[0,1)
1/8
[1,2)
1/2
[2,3)
7/8
[3, ∞)
1
7
Example
3
f
(
x
)

2
(
1

x
)
for x  0
• Let
 x 2(1  t ) 3 dt  1  (1  x )  2 for x  0

F( x )  P(X  x )   0
0
for x  0
P(0.4  X  0.45)  
0.45
0.4
f ( x )dx  F(0.45)  F(0.4)  0.035
8
JOINT DISTRIBUTIONS
• In many applications there are more than one
random variables of interest, say X1, X2,…,Xk.
JOINT DISCRETE DISTRIBUTIONS
• The joint probability mass function (joint pmf)
of the k-dimensional discrete rv
X=(X1, X2,…,Xk) is
f x1, x 2 ,..., x k   PX1  x1, X2  x 2 ,..., Xk  x k 
 x1, x 2 ,..., x k  of X .
9
JOINT DISCRETE DISTRIBUTIONS
• A function f(x1, x2,…, xk) is the joint pmf for
some vector valued rv X=(X1, X2,…,Xk) iff the
following properties are satisfied:
f(x1, x2,…, xk) 0 for all (x1, x2,…, xk)
and
 ... f x1, x 2 ,..., x k   1.
x1
xk
10
Example
• Tossing two fair dice  36 possible sample
points
• Let X: sum of the two dice;
Y: |difference of the two dice|
• For e.g.:
– For (3,3), X=6 and Y=0.
– For both (4,1) and (1,4), X=5, Y=3.
11
Example
• Joint pmf of (x,y)
x
2
0
1
y
2
3
3
1/36
4
5
1/36
1/18
6
7
1/36
1/18
1/18
1/18
1/18
5
9
1/36
1/18
4
8
1/18
1/18
11
1/36
1/18
1/18
10
12
1/36
1/18
1/18
1/18
1/18
1/18
Empty cells are equal to 0.
e.g. P(X=7,Y≤4)=f(7,0)+f(7,1)+f(7,2)+f(7,3)+f(7,4)=0+1/18+0+1/18+0=1/9
12
MARGINAL DISCRETE
DISTRIBUTIONS
• If the pair (X1,X2) of discrete random variables
has the joint pmf f(x1,x2), then the marginal
pmfs of X1 and X2 are
f1  x1    f  x1 , x2  and f 2  x2    f  x1 , x2 
x2
x1
13
Example
• In the previous example,
5
– P(X  2)   P(X  2, y)  P(X  2, y  0) ...  P(X  2, y  5)  1 / 36
y 0
–
P(Y  2) 
12
 P( x, Y  2)  4 / 18
x 2
14
JOINT DISCRETE DISTRIBUTIONS
• JOINT CDF:
Fx1, x 2 ,..., x k   PX1  x1,..., Xk  x k .
• F(x1,x2) is a cdf iff
lim Fx1, x 2   F , x 2   0, x 2 .
x1  
lim
x 2  
Fx1, x 2   Fx1,   0, x1.
lim Fx1, x 2   F,    1
x1 
x 2 
P(a  X1  b, c  X 2  d)  Fb, d   Fb, c   Fa , d   Fa , c   0,  a  b and c  d.
lim Fx1  h, x 2   lim Fx1, x 2  h   Fx1, x 2 ,  x1 and x2 .
h 0 
h 0 
15
JOINT CONTINUOUS DISTRIBUTIONS
• A k-dimensional vector valued rv X=(X1,
X2,…,Xk) is said to be continuous if there is a
function f(x1, x2,…, xk), called the joint
probability density function (joint pdf), of X,
such that the joint cdf can be given as
Fx1, x 2 ,..., x k  
x1 x 2
xk
  ...  f t1, t 2 ,..., t k dt1dt 2 ...dt k
  
16
JOINT CONTINUOUS DISTRIBUTIONS
• A function f(x1, x2,…, xk) is the joint pdf for
some vector valued rv X=(X1, X2,…,Xk) iff the
following properties are satisfied:
f(x1, x2,…, xk) 0 for all (x1, x2,…, xk)
and


 
 

...
 f x1, x 2 ,..., x k dx1dx 2 ...dx k
 1.

17
JOINT CONTINUOUS DISTRIBUTIONS
• If the pair (X1,X2) of discrete random variables
has the joint pdf f(x1,x2), then the marginal
pdfs of X1 and X2 are




f1  x1    f  x1 , x2 dx2 and f 2  x2    f  x1 , x2 dx1 .
18
JOINT DISTRIBUTIONS
• If X1, X2,…,Xk are independent from each
other, then the joint pdf can be given as
f x1, x 2 ,..., x k   f x1 f x 2 ...f x k 
And the joint cdf can be written as
Fx1, x 2 ,..., x k   Fx1 Fx 2 ...Fx k 
19
CONDITIONAL DISTRIBUTIONS
• If X1 and X2 are discrete or continuous random
variables with joint pdf f(x1,x2), then the
conditional pdf of X2 given X1=x1 is defined by
f x1, x 2 
f x 2 x1  
, x1 such that f x1   0, 0 elsewhere.
f x1 
• For independent rvs,
f x2 x1   f  x2 .
f x1 x2   f  x1 .
20
Example
Statistical Analysis of Employment Discrimination Data (Example
from Dudewicz & Mishra, 1988; data from Dawson, Hankey
and Myers, 1982)
% promoted (number of employees)
Pay grade
Affected class
others
5
100 (6)
84 (80)
7
88 (8)
87 (195)
9
93 (29)
88 (335)
10
7 (102)
8 (695)
11
7 (15)
11 (185)
12
10 (10)
7 (165)
13
0 (2)
9 (81)
14
0 (1)
7 (41)
Affected class might be a minority group or e.g. women
21
Example, cont.
• Does this data indicate discrimination against the
affected class in promotions in this company?
• Let X=(X1,X2,X3) where X1 is pay grade of an
employee; X2 is an indicator of whether the
employee is in the affected class or not; X3 is an
indicator of whether the employee was promoted or
not
• x1={5,7,9,10,11,12,13,14}; x2={0,1}; x3={0,1}
22
Example, cont.
Pay grade
Affected class
others
10
7 (102)
8 (695)
• E.g., in pay grade 10 of this occupation (X1=10) there
were 102 members of the affected class and 695
members of the other classes. Seven percent of the
affected class in pay grade 10 had been promoted,
that is (102)(0.07)=7 individuals out of 102 had been
promoted.
• Out of 1950 employees, only 173 are in the affected
class; this is not atypical in such studies.
23
Example, cont.
Pay grade
Affected class
others
10
7 (102)
8 (695)
• E.g. probability of a randomly selected employee
being in pay grade 10, being in the affected class, and
promoted: P(X1=10,X2=1,X3=1)=7/1950=0.0036
(Probability function of a discrete 3 dimensional r.v.)
• E.g. probability of a randomly selected employee
being in pay grade 10 and promoted:
P(X1=10, X3=1)= (7+56)/1950=0.0323 (Note: 8% of 695 > 56) (marginal probability function of X1 and X3)
24
Example, cont.
• E.g. probability that an employee is in the other class
(X2=0) given that the employee is in pay grade 10
(X1=10) and was promoted (X3=1):
P(X2=0| X1=10, X3=1)= P(X1=10,X2=0,X3=1)/P(X1=10, X3=1)
=(56/1950)/(63/1950)=0.89 (conditional probability)
• probability that an employee is in the affected class
(X2=1) given that the employee is in pay grade 10
(X1=10) and was promoted (X3=1):
P(X2=1| X1=10, X3=1)=(7/1950)/(63/1950)=0.11
25
Production problem
• Two companies manufacture a certain type of sophisticated
electronic equipment for the government; to avoid the lawsuits
lets call them C and company D. In the pact, company C has had
5% good output, whereas D had 50% good output (i.e., 95% of C’s
output and 50% of D’s output is not of acceptable quality). The
government has just ordered 10,100 of these devices from
company D and 11,000 from C (maybe political reasons, maybe
company D does not have a large enough capacity for more
orders). Before the production of these devices start, government
scientists develop a new manufacturing method that they believe
will almost double the % of good devices received. Companies C
and D are given this info, but its use is optional: they must each
use this new method for at least 100 of their devices, but its use
beyond that point is left to their discretion.
Production problem, cont.
• When the devices are received and tested, the
following table is observed:
Production method
Results
Standard
New
Bad
5950
9005
Good
5050 (46%)
1095 (11%)
• Officials blame scientists and companies for
producing with the lousy new method which is
clearly inferior.
• Scientists still claim that the new method has almost
doubled the % of good items.
• Which one is right?
Production problem, cont.
• Answer: the scientists rule!
Company
C
Results
D
Standard
New
Standard
New
Bad
950
9000
5000
5
Good
50 (5%)
1000 (10%)
5000 (50%)
95 (95%)
• The new method nearly doubled the % of good
items for both companies.
• Company D knew their production under
standard method is already good, so they used
the new item for only minimum allowed.
• This is called Simpson’s paradox. Do not combine
the results for 2 companies in such cases.
Describing the Population
• We’re interested in describing the population by
computing various parameters.
• For instance, we calculate the population mean
and population variance.
29
EXPECTED VALUES
Let X be a rv with pdf fX(x) and g(X) be a
function of X. Then, the expected value (or
the mean or the mathematical expectation) of
g(X)
 g  x  f X  x  , if X is discrete
 x
E  g  X    
  g  x  f X  x  dx, if X is continuous

providing the sum or the integral exists, i.e.,
<E[g(X)]<.
30
EXPECTED VALUES
• E[g(X)] is finite if E[| g(X) |] is finite.
 g  x  f X  x < , if X is discrete
 x
E  g  X     
  g  x  f X  x  dx< , if X is continuous
 
31
Population Mean (Expected Value)
• Given a discrete random variable X with
values xi, that occur with probabilities p(xi),
the population mean of X is
E(X)     x i  p( x i )
all xi
32
Population Variance
– Let X be a discrete random variable with
possible values xi that occur with
probabilities p(xi), and let E(xi) =. The
variance of X is defined by
V( X)    E( X  )    ( x i  ) p( x i )
2
2
2
Unit*Unit
all xi
The s tan dard deviation is
Unit
  2
33
EXPECTED VALUE
• The expected value or mean value of a
continuous random variable X with pdf f(x) is
  E( X ) 

xf ( x)dx
all x
• The variance of a continuous random
variable X with pdf f(x) is
 2  Var ( X )  E ( X   ) 2 

( x   ) 2 f ( x)dx
all x
 E( X 2 )   2 

all x
( x) 2 f ( x)dx   2
34
EXAMPLE
• The pmf for the number of defective items in
a lot is as follows
0.35, x  0
0.39, x  1

p ( x)  0.19, x  2
0.06, x  3

0.01, x  4
Find the expected number and the variance of
defective items.
35
EXAMPLE
• Let X be a random variable. Its pdf is
f(x)=2(1-x), 0< x < 1
Find E(X) and Var(X).
36
Laws of Expected Value
• Let X be a rv and a, b, and c be constants.
Then, for any two functions g1(x) and g2(x)
whose expectations exist,
a) E  ag1  X   bg 2  X   c   aE  g1  X   bE  g 2  X   c
b) If g1  x   0 for all x, then E  g1  X   0.
c) If g1  x   g 2  x  for all x, then E  g1  x   E  g 2  x .
d ) If a  g1  x   b for all x, then a  E  g1  X   b
37
Laws of Expected Value and Variance
Let X be a rv and c be a constant.
Laws of Expected Value
 E(c) = c
 E(X + c) = E(X) + c
 E(cX) = cE(X)
Laws of
Variance
 V(c) = 0
 V(X + c) = V(X)
 V(cX) = c2V(X)
38
EXPECTED VALUE
E   ai X i    ai E  X i .
 i 1
 i 1
k
k
If X and Y are independent,
Eg  X hY   Eg  X EhY 
The covariance of X and Y is defined as
CovX, Y   EX  EX Y  EY 
 E(XY )  E(X)E (Y)
39
EXPECTED VALUE
If X and Y are independent,
Cov X ,Y   0
The reverse is usually not correct! It is only correct
under normal distribution.
If (X,Y)~Normal, then X and Y are independent
iff
Cov(X,Y)=0
40
EXPECTED VALUE
Var  X 1  X 2   Var  X 1   Var  X 2   2Cov  X 1 , X 2 
If X1 and X2 are independent,
Var  X 1  X 2   Var  X 1   Var  X 2 
41
CONDITIONAL EXPECTATION AND
VARIANCE
 yf  y x 
, if X and Y are discrete.
 y
E Y x    
  yf  y x dy , if X and Y are continuous.
 
Var Y x   E Y x   E Y x 
2
2
42
CONDITIONAL EXPECTATION AND
VARIANCE
E E Y X   E Y 
Var (Y)  EX (Var (Y | X))  VarX (E(Y | X))
(EVVE rule)
Proofs available in Casella & Berger (1990), pgs. 154 & 158
43
Example
• An insect lays a large number of eggs, each
surviving with probability p. Consider a large
number of mothers. X: number of survivors in
a litter; Y: number of eggs laid
• Assume:
X | Y ~ Binomial (Y, p)
Y |  ~ Poisson ()
 ~ Exponentia l(  )
• Find: expected number of survivors, i.e. E(X)
44
Example - solution
EX=E(E(X|Y))
=E(Yp)
=p E(Y)
=p E(E(Y|Λ))
=p E(Λ)
=pβ
45
SOME MATHEMATICAL EXPECTATIONS
• Population Mean:  = E(X)
• Population Variance:
2
2
2
  Var  X   E  X     E  X    2  0
(measure of the deviation from the population mean)
2



0
• Population Standard Deviation:
• Moments:
k

  E  X   the k-th moment
*
k
k  E  X     the k-th central moment
k
46
SKEWNESS
• Measure of lack of symmetry in the pdf.
Skewness 
E X  

3
3
3
 3/2
2
If the distribution of X is symmetric around its
mean ,
3=0  Skewness=0
47
KURTOSIS
• Measure of the peakedness of the pdf. Describes the
shape of the distribution.
Kurtosis 
E X  
4
4
4
 2
2
Kurtosis=3  Normal
Kurtosis >3  Leptokurtic
(peaked and fat tails)
Kurtosis<3  Platykurtic
(less peaked and thinner tails)
48
KURTOSIS
• What is the range of kurtosis?
• Claim: Kurtosis ≥ 1. Why?
• Proof:
Var (Y )  E (Y 2 )  ( EY ) 2
Let Y  ( X 1   ) 2 .
E (( X 1   ) 4 )  Var (( X 1   ) 2 )  [ E (( X 1   ) 2 ) 2 ]
 Var (( X 1   ) 2 )   4
Kurtosis 
Var (( X 1   ) 2 )

4
1  1
49
Measures of Central Location
• Usually, we focus our attention on two
types of measures when describing
population characteristics:
– Central location
– Variability or spread
50
Measures of Central Location
• The measure of central location reflects
the locations of all the data points.
• How?
With two data points,
the central location
But
if
the
third data
With one data point
should
fall inpoint
the middle
on the leftthem
hand-side
clearly the centralappears between
(in order
of
the
midrange,
it
should
“pull”of
location is at the point to reflect the location
the central
location
to the left.
itself.
both
of them).
51
The Arithmetic Mean
• This is the most popular measure of central
location
Sum of the observations
Mean =
Number of observations
52
The Arithmetic Mean
Sample mean
x
n
n
ii11xxii
nn
Sample size
Population mean

N
i1 x i
N
Population size
53
The Arithmetic Mean
• Example
The reported time on the Internet of 10 adults are 0, 7, 12, 5, 33,
14, 8, 0, 9, 22 hours. Find the mean time on the Internet.
10
x01  x72  ...  x22
i 1 xi
10
x


10
10
11.0
54
The Arithmetic Mean
• Drawback of the mean:
It can be influenced by unusual
observations, because it uses all the
information in the data set.
55
The Median
• The Median of a set of observations is the value
that falls in the middle when the observations are
arranged in order of magnitude. It divides the
data in half.
Example
Comment
Find the median of the time on the internet Suppose only 9 adults were sampled
(exclude, say, the longest time (33))
for the 10 adults of previous example
Even number of observations
0, 0, 5,
0, 7,
5, 8,
7, 8,
9, 12,
14,14,
22,22,
33 33
8.59,, 12,
Odd number of observations
0, 0, 5, 7, 8 9, 12, 14, 22
56
The Median
• Depth of median = (n+1)/2
 X (( n 1) / 2) if n is odd

Median   X ( k )  X ( k 1)
if n is even(n  2k )

2

57
The Mode
• The Mode of a set of observations is the value that
occurs most frequently.
• Set of data may have one mode (or modal class), or
two or more modes.
The modal class
58
The Mode
• Find the mode for the data in the Example. Here are
the data again: 0, 7, 12, 5, 33, 14, 8, 0, 9, 22
Solution
• All observation except “0” occur once. There are two “0”s. Thus,
the mode is zero.
• Is this a good measure of central location?
• The value “0” does not reside at the center of this set
(compare with the mean = 11.0 and the median = 8.5).
59
Relationship among Mean, Median, and Mode
• If a distribution is from a bell shaped symmetrical
one, the mean, median and mode coincide
Mean = Median = Mode
• If a distribution is asymmetrical, and skewed
to the left or to the right, the three measures
differ.
A positively skewed distribution
(“skewed to the right”)
Mode < Median < Mean
Mode Mean
Median
60
Relationship among Mean, Median, and
Mode
• If a distribution is non symmetrical, and
skewed to the left or to the right, the three
measures differ.
A positively skewed distribution
(“skewed to the right”)
A negatively skewed distribution
(“skewed to the left”)
Mode
Mean
Median
Mean
Mode
Median
Mean < Median < Mode
61
Measures of variability
• Measures of central location fail to tell the
whole story about the distribution.
• A question of interest still remains unanswered:
How much are the observations spread out
around the mean value?
62
Measures of variability
Observe two hypothetical
data sets:
Small variability
The average value provides
a good representation of the
observations in the data set.
This data set is now
changing to...
63
Measures of Variability
Observe two hypothetical
data sets:
Small variability
The average value provides
a good representation of the
observations in the data set.
Larger variability
The same average value does not
provide as good representation of the
observations in the data set as before.
64
The Range
– The range of a set of observations is the difference
between the largest and smallest observations.
– Its major advantage is the ease with which it can be
computed.
– Its major shortcoming is its failure to provide information
on the dispersion of the observations between the two
end points.
But, how do all the observations spread out?
The range cannot assist in answering this question
? Range
? ?
Smallest
observation
Largest
observation
65
The Variance


This measure reflects the dispersion of all the
observations
The variance of a population of size N x1, x2,…,xN
whose mean is  is defined as
2 

2
N
(
x


)
i
i 1
N
The variance of a sample of n observations
x1, x2, …,xn whose mean is x is defined as
s2 
ni1( xi  x)2
n 1
66
Why not use the sum of deviations?
Consider two small populations:
9-10= -1
11-10= +1
8-10= -2
12-10= +2
A measure of dispersion
A
Can the sum of deviations
agreesofwith
this
Be aShould
good measure
dispersion?
The sum
of deviations is
observation.
zero for both populations,
8 9 10 11 12
therefore, is not a good
…but
Themeasurements
mean of both in B
measure
of
arepopulations
moredispersion.
dispersed
is 10...
4-10 = - 6
16-10 = +6
7-10 = -3
than those in A.
B
4
Sum = 0
7
10
13
16
13-10 = +3
Sum = 067
The Variance
Let us calculate the variance of the two populations
2
2
2
2
2
2 (8  10)  (9  10)  (10  10)  (11  10)  (12  10)
A 
2
5
2
2
2
2
2
2 (4  10)  (7  10)  (10  10)  (13  10)  (16  10)
B 
 18
5
Why is the variance defined as
the average squared deviation?
Why not use the sum of squared
deviations as a measure of
variation instead?
After all, the sum of squared
deviations increases in
magnitude when the variation
of a data set increases!!
68
The Variance
Let us calculate the sum
of squared
deviations
for both data sets
Which
data set has
a larger dispersion?
Data set B
is more dispersed
around the mean
A
B
1
2 3
1
3
5
69
The Variance
SumA = (1-2)2 +…+(1-2)2 +(3-2)2 +… +(3-2)2= 10
SumB = (1-3)2 + (5-3)2 = 8
SumA > SumB. This is inconsistent with the
observation that set B is more dispersed.
A
B
1
2 3
1
3
5
70
The Variance
However, when calculated on “per observation”
basis (variance), the data set dispersions are
properly ranked.
A2 = SumA/N = 10/5 = 2
B2 = SumB/N = 8/2 = 4
A
B
1
2 3
1
3
5
71
The Variance
• Example
– The following sample consists of the number of
jobs six students applied for: 17, 15, 23, 7, 9,
13. Find its mean and variance
• Solution
x
i61 xi
6
17  15  23  7  9  13 84


 14 jobs
6
6

n
2

(
x

x
)
1
2
i1 i
s 

(17  14)2  (15  14)2  ...(13  14)2
n 1
6 1
 33.2 jobs2

72
The Variance – Shortcut method
n
2
n


1
(

x
)
2
2
i1 i
s 
 x i 

n  1  i1
n

2



1  2
17

15

...

13
2
2

 17  15  ...  13 

6  1 
6



 33.2 jobs2
73
Standard Deviation
• The standard deviation of a set of
observations is the square root of the
variance.
Sample standard dev iation: s  s
2
Population standard dev iation:   
2
74
Standard Deviation
• Example
– To examine the consistency of shots for a new
innovative golf club, a golfer was asked to hit 150
shots, 75 with a currently used (7-iron) club, and
75 with the new club.
– The distances were recorded.
– Which club is better?
75
Standard Deviation
• Example – solution
Excel printout, from the
“Descriptive Statistics” submenu.
The innovation club is
more consistent, and
because the means are
close, is considered a
better club
Current
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
Innovation
150.5467
0.668815
151
150
5.792104
33.54847
0.12674
-0.42989
28
134
162
11291
75
Mean
Standard Error
Median
Mode
Standard Deviation
Sample Variance
Kurtosis
Skewness
Range
Minimum
Maximum
Sum
Count
150.1467
0.357011
150
149
3.091808
9.559279
-0.88542
0.177338
12
144
156
11261
75
76
Interpreting Standard Deviation
• The standard deviation can be used to
– compare the variability of several distributions
– make a statement about the general shape of a
distribution.
• The empirical rule: If a sample of observations
has a mound-shaped distribution, the interval
( x  s, x  s) contains approximately 68% of the measuremen ts
( x  2s, x  2s) contains approximately 95% of the measuremen ts
( x  3s, x  3s) contains approximately 99.7% of the measuremen ts
77
Interpreting Standard Deviation
• Example
A practitioner wants to describe the way
returns on investment are distributed.
– The mean return = 10%
– The standard deviation of the return = 8%
– The histogram is bell shaped.
78
Interpreting Standard Deviation
Example – solution
• The empirical rule can be applied (bell shaped
histogram)
• Describing the return distribution
– Approximately 68% of the returns lie between 2% and 18%
[10 – 1(8), 10 + 1(8)]
– Approximately 95% of the returns lie between -6% and 26%
[10 – 2(8), 10 + 2(8)]
– Approximately 99.7% of the returns lie between -14% and
34%
[10 – 3(8), 10 + 3(8)]
79
The Chebyshev’s Theorem
• For any value of k  1, greater than 100(1-1/k2)% of the
data lie within the interval from x  ks to x  ks .
• This theorem is valid for any set of measurements
(sample, population) of any shape!!
k
Interval
Chebyshev
Empirical Rule
1
2
3
x  s, x  s
x  2s, x  2s
x  3s, x  3s
at least 0%
at least 75%
at least 89%
(1-1/12)
(1-1/22)
(1-1/32)
approximately 68%
approximately 95%
approximately 99.7%
80
The Chebyshev’s Theorem
• Example
– The annual salaries of the employees of a chain of computer
stores produced a positively skewed histogram. The mean and
standard deviation are $28,000 and $3,000,respectively. What
can you say about the salaries at this chain?
Solution
At least 75% of the salaries lie between $22,000 and $34,000
28000 – 2(3000) 28000 + 2(3000)
At least 88.9% of the salaries lie between $19,000 and $37,000
28000 – 3(3000) 28000 + 3(3000)
81
The Coefficient of Variation
• The coefficient of variation of a set of measurements
is the standard deviation divided by the mean value.
s
Sample coefficien t of variation : cv 
x

Population coefficien t of variation : CV 

• This coefficient provides a proportionate measure of
variation.
A standard deviation of 10 may be perceived
large when the mean value is 100, but only
moderately large when the mean value is 500
82
Percentiles
•
Example from http://www.ehow.com/how_2310404_calculate-percentiles.html
• Your test score, e.g. 70%, tells you how many
questions you answered correctly. However, it
doesn’t tell how well you did compared to the other
people who took the same test.
• If the percentile of your score is 75, then you scored
higher than 75% of other people who took the test.
83
Sample Percentiles and Box Plots
• Percentile
– The pth percentile of a set of measurements is the
value for which
• p percent of the observations are less than that value
• 100(1-p) percent of all the observations are greater than
that value.
84
Sample Percentiles
•Find the 10 percentile of 6 8 3 6 2 8 1
•Order the data: 1 2 3 6
6 8 8
•7*(0.10) = 0.70; round up to 1
The first observation, 1, is the 10 percentile.
85
• Commonly used percentiles
– First (lower) quartile, Q1 = 25th percentile
– Second (middle) quartile,Q2 = 50th percentile
– Third quartile, Q3 = 75th percentile
– Fourth quartile, Q4 = 100th percentile
– First (lower) decile = 10th percentile
– Ninth (upper) decile = 90th percentile
86
Quartiles and Variability
• Quartiles can provide an idea about the shape
of a histogram
Q1 Q2
Positively skewed
histogram
Q3
Q1
Q2
Q3
Negatively skewed
histogram
87
Interquartile Range
• Large value indicates a large spread of the
observations
Interquartile range = Q3 – Q1
88
Box Plot
– This is a pictorial display that provides the main
descriptive measures of the data set:
•
•
•
•
•
L - the largest observation
Q3 - The upper quartile
Q2 - The median
Q1 - The lower quartile
S - The smallest observation
1.5(Q3 – Q1)
S
Whisker
1.5(Q3 – Q1)
Q1
Q2 Q 3
Whisker
L
89
Box Plot
– The following data give noise levels measured at 36
different times directly outside of Grand Central
Station in Manhattan.
NOISE
82
89
94
110
.
.
.
Smallest = 60
Q1 = 75
Median = 90
Q3 = 107
Largest = 125
IQR = 32
Outliers =
BoxPlot
75
75-1.5(IQR)=27
60
70
107
80
90
100
110
120
130
107+1.5(IQR)
90
=155
Box Plot
NOISE - continued
Q1
75
60
25%
Q2
90
Q3
107
50%
125
25%
– Interpreting the box plot results
• The scores range from 60 to 125.
• About half the scores are smaller than 90, and about half are larger
than 90.
• About half the scores lie between 75 and 107.
• About a quarter lies below 75 and a quarter above 107.
• Data is slightly positively skewed.
91
Box Plot
Example: A study was organized to compare the service time in
5 drive through restaurants.
Jack in the Box5
Jack in the box is the slowest in service
Hardee’s
Hardee’s service time variability is the largest
C7
McDonalds
4
3
Wendy’s
2
Popeyes
1
Wendy’s service time appears to be the
shortest on average and most consistent.
100
300
200
C6
92
Box Plot
Times are symmetric
Jack in the Box5
Jack in the box is the slowest in service
Hardee’s
Hardee’s service time variability is the largest
C7
McDonalds
4
3
Wendy’s
2
Popeyes
1
Wendy’s service time appears to be the
shortest and most consistent.
100
300
200
C6
Times are positively skewed
93
Paired Data Sets and the Sample
Correlation Coefficient
• The covariance and the coefficient of
correlation are used to measure the direction
and strength of the linear relationship
between two variables.
– Covariance - is there any pattern to the way two
variables move together?
– Coefficient of correlation - how strong is the linear
relationship between two variables
94
Covariance
Population covariance  COV(X, Y) 
(x i   x )(y i   y )
N
x (y) is the population mean of the variable X (Y).
N is the population size.
(xi  x)(y i  y)
Sample cov ariance cov (x y, ) 
n-1
x (y) is the sample mean of the variable X (Y).
n is the sample size.
95
Covariance
• If the two variables move in the same
direction, (both increase or both decrease),
the covariance is a large positive number.
• If the two variables move in opposite
directions, (one increases when the other
one decreases), the covariance is a large
negative number.
• If the two variables are unrelated, the
covariance will be close to zero.
96
Covariance
• Compare the following three sets
xi
yi
(x – x)
(y – y)
(x – x)(y – y)
2
6
7
13
20
27
-3
1
2
-7
0
7
21
0
14
x=5
y =20
Cov(x,y)=17.5
xi
yi
(x – x)
(y – y)
(x – x)(y – y)
2
6
7
27
20
13
-3
1
2
7
0
-7
-21
0
-14
x=5
y =20
Cov(x,y)=-17.5
xi
yi
2
6
7
20
27
13
Cov(x,y) = -3.5
x=5 y =20
97
The coefficient of correlation
Population coefficien t of correlatio n
COV ( X, Y)

xy
Sample coefficien t of correlatio n
cov(X, Y)
r
sx sy
– This coefficient answers the question: How
strong is the association between X and Y.
98
The coefficient of correlation
+1 Strong positive linear relationship
COV(X,Y)>0
 or r =
or
0
No linear relationship
-1 Strong negative linear relationship
COV(X,Y)=0
COV(X,Y)<0
99
The Coefficient of Correlation
• If the two variables are very strongly positively
related, the coefficient value is close to +1
(strong positive linear relationship).
• If the two variables are very strongly negatively
related, the coefficient value is close to -1
(strong negative linear relationship).
• No straight line relationship is indicated by a
coefficient close to zero.
100
The Coefficient of Correlation
101
Correlation and causation
• Recognize the difference between correlation and
causation — just because two things occur together,
that does not necessarily mean that one causes the
other.
• For random processes, causation means that if A
occurs, that causes a change in the probability that B
occurs.
102
Correlation and causation
• Existence of a statistical relationship, no matter how strong it
is, does not imply a cause-and-effect relationship between X
and Y. for ex, let X be size of vocabulary, and Y be writing
speed for a group of children. There most probably be a
positive relationship but this does not imply that an increase
in vocabulary causes an increase in the speed of writing.
Other variables such as age, education etc will affect both X
and Y.
• Even if there is a causal relationship between X and Y, it might
be in the opposite direction, i.e. from Y to X. For eg, let X be
thermometer reading and let Y be actual temperature. Here Y
will affect X.
103
Example
Dr. Leonard Eron, professor at the University of Illinois at Chicago, has
conducted a longitudinal study of the long–term effects of violent
television programming. In 1960, he asked 870 third grade children
their favorite television shows. He found that children judged most
violent by their peers also watched the most violent television. Dr.
Eron noted, however, that it was not clear which came first — the
child’s behavior or the influence of television.
In follow-up interviews at ten–year intervals, Eron found that
youngsters who at age eight were nonaggressive but were watching
violent television were more aggressive than children who at age
eight were aggressive and watched non–violent television. Eron
claims that this establishes a cause–and–effect relationship
between watching violent television and aggressive behavior.
Can you think of any other possible causes?
104
Example - solution
• It could be that the difference in aggressive
behavior is due to other familial influences.
Perhaps children who are permitted to watch
violent programming are more likely to come
from violent or abusive families, which could
also lead to more aggressive behavior.
105