Download Probability and Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Statistics
© Copyright 2001, Alan Marshall
1
Statistics
 Branch
of Mathematics that deals with the
collection and analysis of data
 Descriptive Statistics: used to analyze and
describe data
 Inferential Statistics: used to use the
information to make statements regarding
the relationships between variables or the
expectations about future events.
© Copyright 2001, Alan Marshall
2
Measures of Central Tendency
© Copyright 2001, Alan Marshall
3
Measures of Central Tendency
 Arithmetic
Mean
 Median
 Mode
 Geometric
© Copyright 2001, Alan Marshall
Mean
4
Arithmetic Mean
 Other
names
 Average
 Mean
Sum of the measuremen ts
Mean 
Number of measuremen ts
© Copyright 2001, Alan Marshall
5
Arithmetic Mean
Sample
Population
n
x
x
i1
n
N
i
x 
 xi
i 1
N
 The
calculation is identical, just the notation
varies slightly
© Copyright 2001, Alan Marshall
6
Summation Notation

N
x

x

t
t
t 1
N
t 1
 Notice
that the first form uses less vertical space
on the page
 This makes accountants very happy
 The first can also be easier to fit into a line of text
© Copyright 2001, Alan Marshall
7
Example
 Ten
second year BBA students wrote the CSC
exam last month
 Their scores were:
71, 72, 88, 69, 77, 63, 91, 81, 83, 75

71  72  88  69  77  63  91  81  83  75 
x
 77
10
© Copyright 2001, Alan Marshall
8
Calculating the Mean
 Arithmetic
mean
 sum
the observations and divide by the number of
observations
 Example:
5%, 7%, -2%, 12%, 8%
N
r
t
30%
r

 6%
N
5
t 1
© Copyright 2001, Alan Marshall
9
Problem with the Arithmetic Mean
 Arithmetic
mean is incorrect for variables that are
related multiplicatively, like rates of growth, rates
of return and rates of change
 $1,000 at 6% for 5 years should be $1,338.23
(1000)(1.06)5
=1338.23
© Copyright 2001, Alan Marshall
Starting
Value
1000.00
1050.00
1123.50
1101.03
1233.15
Rate of
Return
5%
7%
-2%
12%
8%
Ending
Value
1050.00
1123.50
1101.03
1233.15
1331.81
10
Geometric Mean
 The
Geometric Mean should be used for rates of
change, like rates of return
 N

r    (1  rt ) 
 t 1

 1
 
N
1
 1
 
5
 1.05 1.07 0.98 1.121.08 
 (1.3318059)
© Copyright 2001, Alan Marshall
 0. 2 
1
 1  5.898%
11
Geometric Mean
 The
Geometric Mean should be used for rates of
change, like rates of return
 N

r    (1  rt ) 
 t 1

 1
 
N
Means: The product of
these factors from 1 to N
1
 1
 
5
 1.05 1.07 0.98 1.121.08 
 (1.3318059)
© Copyright 2001, Alan Marshall
 0. 2 
1
 1  5.898%
12
Geometric vs. Arithmetic Mean
 The
more variable the underlying data, the
greater the error using the Arithmetic mean
 The Geometric Mean is often easier to
calculate:
 Stock
prices: 1992: $20; 1999: $40, R =
10.41%
© Copyright 2001, Alan Marshall
13
Geometric vs. Arithmetic Mean
 For
analysis of past performance, use the
Geometric mean
 The
past returns have averaged 5.898%
 To
use the past returns to estimate the
future expected return, use the Arithmetic
mean
 The
expected return is 6%
© Copyright 2001, Alan Marshall
14
Median and Mode
 Median:
Midpoint
 If
odd number of observations: Middle
observation
 If even number of observations: Average of
middle 2 observations
 Mode:
Most frequent
© Copyright 2001, Alan Marshall
15
Example
 Our
CSC mark data was (sorted):
63, 69, 71, 72, 75, 77, 81, 83, 88, 91
 The median is 76
 There is no mode
© Copyright 2001, Alan Marshall
16
Example
x
71
72
88
69
77
63
91
81
83
75
77
Deviation
-6
-5
11
-8
0
-14
14
4
6
-2
0
© Copyright 2001, Alan Marshall
 The
Deviation is the
difference between
each observation and
the mean
 The sign indicates
whether the
observation is above
(+) or below (-) the
mean
17
Example
x
71
72
88
69
77
63
91
81
83
75
77
Deviation
-6
-5
11
-8
0
-14
14
4
6
-2
0
© Copyright 2001, Alan Marshall
 The
average deviation
is always zero
 If it isn’t, you must
have made a mistake!
18
Measures of Dispersion
© Copyright 2001, Alan Marshall
19
Measures of Dispersion
 So
far, we have look at measures of central
tendency
 What about measuring the tendency of the
data to vary from these centre?
© Copyright 2001, Alan Marshall
20
Measures of Dispersion
 Range
 Highest
- Lowest
 Variance
 Standard
Deviation
© Copyright 2001, Alan Marshall
21
Example
x
71
72
88
69
77
63
91
81
83
75
77
Deviation
-6
-5
11
-8
0
-14
14
4
6
-2
0
© Copyright 2001, Alan Marshall
 The
range is 91-63=28
 The range can be
extremely sensitive to
outlier observations
 Suppose one of these
students had a very
bad day and scored 8.
 The
range would now
be 91-8=83
22
Mean Absolute Deviation
x
71
72
88
69
77
63
91
81
83
75
77
Deviation
-6
-5
11
-8
0
-14
14
4
6
-2
0
© Copyright 2001, Alan Marshall
|D|
6
5
11
8
0
14
14
4
6
2
7
The Mean Absolute
Deviation is a measure
of average dispersion
that is not used very
much
 It has some undesirable
mathematical properties
beyond the level of this
course

23
Mean Squared Deviation
x
71
72
88
69
77
63
91
81
83
75
77
Deviation
-6
-5
11
-8
0
-14
14
4
6
-2
0
© Copyright 2001, Alan Marshall
D2
36
25
121
64
0
196
196
16
36
4
694
 The
Mean Squared
Deviation is very
commonly used
 The MSD in this
example is
694/10=69.4
 The more common
name of the MSD is
the VARIANCE
24
Variance
 Variance
measures the amount of dispersion from
the mean.
 For Populations:
For Samples:
x  



2
i
N
© Copyright 2001, Alan Marshall

x  x


2
2
ˆ  s
2
2
i
n 1
25
Standard Deviation
 Standard
Deviation measures the amount of
dispersion from the mean.
 For Populations:
For Samples:
 x  x 

i
N
© Copyright 2001, Alan Marshall
 x  x 
2
2
ˆ  s 
i
N1
26
Standard Deviation Example
 Using
the previous example
 The data is sample data
ˆ  s 
 x
i
x

2
N1
694

 8.78
10  1
© Copyright 2001, Alan Marshall
27
Interpreting the Std. Dev.
 You
have heard of the Bell Shaped or
Normal Distribution
 The properties of the Normal Distribution
are well known and give us the EMPIRICAL
RULE
© Copyright 2001, Alan Marshall
28
Normal Distribution
2/ 3
95%
99.7%
0.15%
Tail
-4
2.5%
Tail
2.5%
Tail
-3
-2
-1
0
1
2
0.15%
Tail
3
4
Z Value- Standard Deviations from Mean
© Copyright 2001, Alan Marshall
29
Empirical Rule
For approximately Normally Distributed data:
 Within 1 of the mean: approx.. 2/3s
 Within 2 of the mean: approx. 95% (19/20)
 Within 3 of the mean: virtually all
© Copyright 2001, Alan Marshall
30
Quartiles, Percentiles, etc.
 The
Median splits the data in half
 Quartiles split the data into quarters
 Deciles split the data into tenths
 Percentiles split the data into onehundredths
© Copyright 2001, Alan Marshall
31
Rank Measures
 “That
was a top-half performance”
 “WTG Special fund has been a top quartile
performer for the past 5 years”
 “Our programme accepts only students
proven to be top decile performers”
 “I was in the 92nd percentile on the GMAT”
© Copyright 2001, Alan Marshall
32
Using Excel
 Full
Descriptive Statistics
 Tools

Data Analysis

Descriptive Statistics
© Copyright 2001, Alan Marshall
33
Measures of Association
© Copyright 2001, Alan Marshall
34
Bivariate Statistics
 So
far, we have been dealing with statistics
of individual variables
 We also have statistics that relate pairs of
variables
© Copyright 2001, Alan Marshall
35
Interactions
Sometimes two variables appear related:
 smoking and lung cancers
 height and weight
 years of education and income
 engine size and gas mileage
 GMAT scores and MBA GPA
 house size and price
© Copyright 2001, Alan Marshall
36
Interactions
 Some
of these variables would appear to
positively related & others negatively
 If these were related, we would expect to
be able to derive a linear relationship:
y = a + bx
 where,

b is the slope, and
a is the intercept
© Copyright 2001, Alan Marshall
37
Linear Relationships
 We
will be deriving linear relationships from
bivariate (two-variable) data
 Our symbols will be:
y  0  1x   or ŷ  0  1x
ˆ 1  Slope
ˆ 0  Intercept
  Error term
© Copyright 2001, Alan Marshall
38
Example
 Consider
the following example comparing
the returns of Consolidated Moose Pasture
stock (CMP) and the TSE 300 Index
 The next slide shows 25 monthly returns
© Copyright 2001, Alan Marshall
39
Example Data
TSE CMP TSE CMP TSE CMP
x
y
x
y
x
y
3
4
-4
-3
2
4
-1
-2
-1
0
-1
1
2
-2
0
-2
4
3
4
2
1
0
-2
-1
5
3
0
0
1
2
-3
-5
-3
1
-3
-4
-5
-2
-3
-2
2
1
1
2
1
3
-2
-2
2
-1
© Copyright 2001, Alan Marshall
40
Example
 From
the data, it appears that a positive
relationship may exist
 Most
of the time when the TSE is up, CMP is
up
 Likewise, when the TSE is down, CMP is down
most of the time
 Sometimes, they move in opposite directions
 Let’s
graph this data
© Copyright 2001, Alan Marshall
41
Graph Of Data
6
CMP
4
2
0
-6
-4
-2
0
2
4
TSE
6
-2
-4
-6
© Copyright 2001, Alan Marshall
42
Example Summary Statistics
 The
data do appear to be positively related
 Let’s derive some summary statistics about these
data:
2
Mean
s
CMP
0.00
7.25
2.69
TSE
0.00
6.25
2.50
© Copyright 2001, Alan Marshall
s
43
Observations
 Both
have means of zero and standard
deviations just under 3
 However, each data point does not have
simply one deviation from the mean, it
deviates from both means
 Consider Points A, B, C and D on the next
graph
© Copyright 2001, Alan Marshall
44
Graph of Data
6
CMP
4
A
2
B
0
-6
-4
-2
0
-2
2
TSE
4
6
D
C
-4
-6
© Copyright 2001, Alan Marshall
45
Implications
 When
points in the upper right and lower
left quadrants dominate, then the sums of
the products of the deviations will be
positive
 When points in the lower right and upper
left quadrants dominate, then the sums of
the products of the deviations will be
negative
© Copyright 2001, Alan Marshall
46
An Important Observation
 The
sums of the products of the deviations
will give us the appropriate sign of the slope
of our relationship
© Copyright 2001, Alan Marshall
47
Covariance
(Showing the formula only to demonstrate a concept)
 x i   x y i   y 
N
COV ( X, Y )   XY 
i1
N
 x i  x y i  y   x i y i
i1
n
cov( X, Y )  s XY 
© Copyright 2001, Alan Marshall
n 1

 x i  y i 

n 1
n
48
Covariance
6
CMP
4
A
2
B
0
-6
-4
-2
0
-2
2
TSE
4
6
D
C
-4
-6
© Copyright 2001, Alan Marshall
49
Covariance
 In
the same units as Variance (if both
variables are in the same unit), i.e. units
squared
 Very important element of measuring
portfolio risk in finance
© Copyright 2001, Alan Marshall
50
Covariance in Excel
 Tools
 Data Analysis

Covariance
Column 1 Column 2
Column 1
7.25
Column 2
4.875
6.25
© Copyright 2001, Alan Marshall
51
Interpreting the Result
Column 1 Column 2
Column 1
7.25
Column 2
4.875
6.25
 This
gives us the variances (7.25 & 6.25) and the
covariance between the variables, 4.875
 In fact, variance is simply the covariance of a
variable with itself!
© Copyright 2001, Alan Marshall
52
Using Covariance
 Very
useful in Finance for measuring
portfolio risk
 Unfortunately, it is hard to interpret for two
reasons:
 What
does the magnitude/size imply?
 The units are confusing
© Copyright 2001, Alan Marshall
53
A More Useful Statistic
 We
can simultaneously adjust for both of
these shortcomings by dividing the
covariance by the two relevant standard
deviations
 This operation
 Removes
the impact of size & scale
 Eliminates the units
© Copyright 2001, Alan Marshall
54
Correlation
 Correlation
measures the sensitivity of one
variable to another, but ignoring magnitude
 Range: -1 to 1
 +1: Implies perfect positive co-movement
 -1: Implies perfect negative co-movement
 0: No relationship
© Copyright 2001, Alan Marshall
55
Calculating Correlation
 XY
COV ( X, Y )

 X  Y 
rXY  ˆ XY
© Copyright 2001, Alan Marshall
cov(X, Y)

sXsY
56
Correlation in Excel
 Tools
 Data Analysis

Correlation
Column 1 Column 2
Column 1
1
Column 2 0.724212
1
© Copyright 2001, Alan Marshall
57
Interpreting the Result
Column 1 Column 2
Column 1
1
Column 2 0.724212
1
 The
correlation of a variable with itself is 1
 The correlation between CMP and the TSE Index
in this example is 0.724
 This
is positive, and relatively strong
© Copyright 2001, Alan Marshall
58
Estimating Linear
Relationships
© Copyright 2001, Alan Marshall
59
Estimating Linear Relationships
 Often
the data imply that a linear
relationship exists
 We can estimate this relationship using the
Least Squares Method of Regression
 We will just learn to use the Excel output
and interpret it
© Copyright 2001, Alan Marshall
60
TSE-CMP Regression Output
(Abridged)
SUMMARY OUTPUT
Regression Statistics
Multiple R
0.724211819
R Square
0.524482759
Adjusted R Square 0.503808096
Standard Error
1.76102226
Observations
25
Intercept
X Variable 1
© Copyright 2001, Alan Marshall
Coefficients Standard Error t Stat
P-value
0
0.352204452
0
1
0.672413793
0.133502753 5.036704 4.26E-05
61
Interpreting the Output
SUMMARY OUTPUT
rCMP = 0 + 0.6724(rTSE) + e
Regression Statistics
Multiple R
0.724211819
R Square
0.524482759
Adjusted R Square 0.503808096
Standard Error
1.76102226
Observations
25
Intercept
X Variable 1
Correlation
Coefficient
Intercept
Coefficients Standard Error t Stat
P-value
0
0.352204452
0
1
0.672413793
0.133502753 5.036704 4.26E-05
Slope
© Copyright 2001, Alan Marshall
62
Where We Are Going
 We
will develop the use of the regression
technique more fully
 Multiple
 Some
explanatory variables
Time-Series Applications
© Copyright 2001, Alan Marshall
63