Download Statistical Concepts and Market Returns

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Sufficient statistic wikipedia , lookup

Psychometrics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Topic 2:
Statistical Concepts and Market Returns
Descriptive Statistics
• The arithmetic mean is the sum of the observations divided by the
number of observations.
– The population mean is given by µ
N
 
X
i 1
i
N
– The sample mean looks at
the arithmetic average of the sample
n
of data.
 Xi
X
i 1
n
• The median is the value of the middle item of a set of items that has
been sorted by ascending or descending order.
• The mode is the most frequently occurring value in a distribution.
• The weighted mean allows us to place greater importance on
different observations. For example, we may choose to give larger
companies greater weight in our computation of an index. In this
case, we would weight each observation based on its relative size.
Descriptive Statistics
• The geometric mean is most frequently used to average rates of
change over time or to compute the growth rate of a variable.
G  [ X1X 2 ...X n ]1/ n
with X i  0 for i  1, 2, . . . , n
– Geometric Mean Using Natural Logs
1
ln( X1X 2 X 3 ...X n )
n
once ln(G) is computed G  e ln(G)
ln( G) 
– The geometric mean return allows us to compute the average
return when there is compounding.
1  R G  (1  R 1 )(1  R 2 )(1  R 3 )...(1  R T )
1
T
1
T


R G   (1  R t )  1
 t 1

T
Descriptive Statistics
•
•
•
•
•
Quartiles divide the data into quarters.
Quintiles divide the data into fifths.
Deciles divide the data into tenths.
Percentiles divide the data into hundredths.
Variance measures the average squared deviation from the mean.
N
2 
 ( X  )
2
i
i 1
N
N

• Population Standard Deviation
n
• Sample Variance
 (X i  X) 2
s2 
i 1
• Sample Standard Deviation
n 1
n
s
 (X
i 1
i
 X) 2
n 1
 (X
i 1
i
 ) 2
N
Descriptive Statistics
• Often times, observations above the mean are good, the variance
is not a good measure of risk. Semivariance looks at the average
squared deviations below the mean.
(X i  X) 2
 (n *  1)
for all X i  X
• The coefficient of variation is the ratio of the standard deviation to
s
their mean value.
CV 
X
– measure of relative dispersion
– can compare the dispersion of data with different scales
• Skewness measures the symmetry of a distribution.
– A symmetric distribution has a skewness of 0.
– Positive skewness indicates that the mean is greater than the
median (more than half the deviations from the mean are
negative)
– Negative skewness indicates that the mean is less than the
median (less than half the deviations from the mean are
negative)
Binomial Distribution
• Sometimes a random variable can only take on two values, success
or failure. This is referred to as a Bernoulli random variable.
• A Bernoulli trial is an experiment that produces only two outcomes.
• Y = 1 for success and Y = 0 for failure.
p(1)  P ( Y  1)  p
p(0)  P ( Y  0)  1  p
• A binomial random variable X is defined as the number of
successes in n Bernoulli trials.
X  Y1  Y2    Yn
• Binomial distribution assumes
– The probability, p, of success is constant for all trials
– The trials are independent
n
n!
p( x )  P(X  x )   p x (1  p) n  x
p x (1  p) n  x
(n  x )! x!
x
A Binomial Model of Stock Price
Movements
Normal Distribution
  ( x  ) 2 
1
 for    x  
f (x) 
exp 
2
 2
 2

Two Normal Distributions
Units of Standard Deviation
Normal Distribution
• Approximately 50 percent of all observations fall in the interval μ ±
(2/3)σ.
• Approximately 68 percent of all observations fall in the interval μ ± σ.
• Approximately 95 percent of all observations fall in the interval μ ±
2σ.
• Approximately 99 percent of all observations fall in the interval μ ±
3σ.
• Standard normal distribution has a mean of zero and a standard
deviation of 1. We use Z to denote the standard normal random
variable.
X 
Z

• The lognormal distribution is widely used for modeling the probability
distribution of asset prices.
Two Lognormal Distributions
Statistical Inference
• In statistics we are often times interested in obtaining information
about the value of some parameter of a population.
• To obtain this information we usually take a smaller subset of the
population and try to draw some conclusions from this sample.
• Sampling distribution of a statistic is the distribution of all the distinct
possible values that the statistic can assume when computed from
samples of the same size randomly drawn from the same
population.
• Cross-sectional data represent observations over individual units at
a point in time, as opposed to time series data.
• Time series data is a set of observations on a variable’s outcomes in
different time periods.
• Investment analysts commonly work with both time-series and
cross-sectional data.
Central Limit Theorem
• The central limit theorem states that for
large sample sizes, for any underlying
distribution for a random variable, the
sampling distribution of the sample mean
for that variable will be approximately
normal, with mean equal to the population
mean for that random variable and
variance equal to the population variance
of the variable divided by sample size.
Standard Error of the Sample Mean
• For a sample mean calculated from a sample generated from a
population with standard deviation σ, the standard error of the
sample mean is
– when we know σ.

X 
n
– If the population standard deviation is unknown we have,
s
sX 
n
– In practice, the population variance is almost always unknown.
To compute the sample standard deviation we use,
 X
n
s2 
i 1
i
X
n 1

2
Point and Interval Estimates of the Population Mean
• An estimator is a formula for estimating a parameter. An estimate is
a particular value that we calculate from a sample by using an
estimator.
• An unbiased estimator is one whose expected value equals the
parameter it is intended to estimate.
• An unbiased estimator is efficient if no other unbiased estimator of
the same population parameter has a sampling distribution with
smaller variance.
• A consistent estimator is one for which the probability of estimates is
close to the value of the population parameter increases as sample
size increases.
• A confidence interval is an interval for which we can assert with a
given probability 1 − α, called the degree of confidence, that it will
contain the parameter it is intended to estimate.
Confidence Intervals for the
Population Mean
• For normally distributed population with known
variance.
X  z / 2

n
• For large sample, population variance unknown.
X  z / 2
s
n
Confidence Intervals for the
Population Mean
• Population variance unknown, t-Distribution
X  t/2
s
n
• The t-distribution is a symmetrical probability
distribution defined by a single parameter known
as degrees of freedom (df).
Student’s t-Distribution versus the Standard Normal
Distribution
Selection of Sample Size
• All else equal, a larger sample size
decreases the width of the confidence
interval.
Standard error of the sample mean 
Sample standard deviation
Sample size
Bias in Sampling
• Sample selection bias is the error of distorting a statistical analysis
due to how the samples are collected.
• Look-ahead bias occurs when information that was not available on
the test date is used in the estimation.
• Time-period bias occurs when the test is based on a time period that
may make the results time-period specific.
• Survivorship bias occurs if companies are excluded from the
analysis because they have gone out of business or because of
reasons related to poor performance.
• Data mining bias – Data mining is the practice of determining a
model by extensive searching through a dataset for statistically
significant patterns.
• An out-of-sample test uses a sample that does not overlap the time
period(s) of the sample(s) on which a variable, strategy, or model,
was developed.
Hypothesis Testing
• Often times we are interested in testing the validity of
some statement.
– For example, Is the underlying mean return on this
mutual fund different from the underlying mean return
on its benchmark?
• Hypothesis testing is part of the branch of statistics
known as statistical inference.
• A hypothesis is a statement about one or more
populations.
Steps in Hypothesis Testing
1. Stating the hypotheses.
2. Identifying the appropriate test statistic
and its probability distribution.
3. Specifying the significance level.
4. Stating the decision rule.
5. Collecting the data and calculating the
test statistic.
6. Making the statistical decision.
Null vs. Alternative Hypothesis
• The null hypothesis is the hypothesis to be
tested.
• The alternative hypothesis is the
hypothesis accepted when the null
hypothesis is rejected.
Formulation of Hypotheses
1. H0: θ = θ0 versus HA: θ ≠ θ0
2. H0: θ ≤ θ0 versus HA: θ > θ0
3. H0: θ ≥ θ0 versus HA: θ < θ0
The first formulation is a two-sided test. The other two
are one-sided tests.
Test Statistic
• A test statistic is a quantity, calculated based on a sample, whose
value is the basis for deciding whether or not to reject the null
hypothesis.
Test statistic 
Sample statistic  Value of the population parameter under H 0
Standard error of the sample statistic
• In reaching a statistical decision, we can make two possible errors:
– We may reject a true null hypothesis (a Type I error), or
– We may fail to reject a false null hypothesis (a Type II error).
• The level of significance of a test is the probability of a Type I error
that we accept in conducting a hypothesis test, is denoted by α.
• The standard approach to hypothesis testing involves specifying a
level of significance (probability of Type I error) only.
• The power of a test is the probability of correctly rejecting the null
(rejecting the null when it is false).
• A rejection point (critical value) for a test statistic is a value with
which the computed test statistic is compared to decide whether to
reject or not reject the null hypothesis.
Test Statistic
• The p-value is the smallest level of significance at which the null
hypothesis can be rejected.
• The smaller the p-value, the stronger the evidence against the null
hypothesis and in favor of the alternative hypothesis.
Hypothesis Tests Concerning
the Mean
• Can test that the mean of a population is
equal to or differs from some hypothesized
value.
• Can test to see if the sample means from
two different populations differ.
Tests Concerning a Single Mean
• A t-test is usually used to test a hypothesis concerning the value of
a population mean.
• If the variance is unknown and the sample is large, or the sample
is small but the population is normally distributed, or approximately
normally distributed.
X  0
t n 1 
s/ n
where ,
t n 1  t  statistic with n  1 degrees of freedom
X  sample mean
 0  the hypothesiz ed value of the population mean
s  sample standard deviation
Tests Concerning a Single Mean
• If the population sampled is normally distributed with
known variance σ2, then the test statistic for a hypothesis
test concerning a single population mean, µ, is
X  0
z
/ n
where ,
 0  the hypothesiz ed value of the population mean
  known population standard deviation
Tests Concerning a Single Mean
• If the population sampled has unknown
variance and the sample is large, in place
of a t-test, an alternative statistic is
X  0
z
s/ n
where ,
s  known population standard deviation
Rejection Points for a z-Test For
α = 0.10
1. H0: θ = θ0 verus Ha: θ ≠ θ0
Reject the null hypothesis if z > 1.645 or
if z < -1.645.
2. H0: θ ≤ θ0 verus Ha: θ > θ0
Reject the null hypothesis if z > 1.28
3. H0: θ ≥ θ0 verus Ha: θ < θ0
Reject the null hypothesis if z < -1.28
Rejection Points for a z-Test
For α = 0.05
1. H0: θ = θ0 verus Ha: θ ≠ θ0
Reject the null hypothesis if z > 1.96 or if
z < -1.96.
2. H0: θ ≤ θ0 verus Ha: θ > θ0
Reject the null hypothesis if z > 1.645
3. H0: θ ≥ θ0 verus Ha: θ < θ0
Reject the null hypothesis if z < -1.645
Rejection Points for a z-Test
For α = 0.01
1. H0: θ = θ0 verus Ha: θ ≠ θ0
Reject the null hypothesis if z > 2.575 or
if z < -2.575
2. H0: θ ≤ θ0 verus Ha: θ > θ0
Reject the null hypothesis if z > 2.33
3. H0: θ ≥ θ0 verus Ha: θ < θ0
Reject the null hypothesis if z < -2.33
Rejection Points, 0.05 Significance Level, TwoSided Test of the
Population Mean Using a z-Test
Rejection Point, 0.05 Significance Level,
One-Sided Test of the
Population Mean Using a z-Test
Tests Concerning the Differences
between Means
• Sometimes we are interested in testing
whether the mean value differs between
two groups.
• If reasonable to assume
– normally distributed
– samples are independent
• We can combine observations from both
samples to get a pooled estimate of the
unknown population variance.
Formulation of Hypotheses
1. H0: µ1 - µ2 = 0 versus HA: µ1 - µ2 ≠ 0
2. H0: µ1 - µ2 ≤ 0 versus HA: µ1 - µ2 > 0
3. H0: µ1 - µ2 ≥ 0 versus HA: µ1 - µ2 < 0
Test Statistic for a Test of Difference between 2 Population
Means
• Normally distributed populations, population variances unknown, but
assumed to be equal.

X  X      
t
1
2
1
1/ 2
2
 s 2p s 2p 
  
n n 
2 
 1
– Pooled Estimator of the Common Variance
2
2
(
n

1
)
s

(
n

1
)
s
2
1
1
2
2
sp 
n1  n 2  2
– degrees of freedom is n1 + n2 - 2
Test Statistic for a Test of Difference between 2 Population
Means
• Normally distributed populations, population variances unequal and
unknown.

X  X      
t
1
2
1
1/ 2
2
s
s 
  
 n1 n 2 
2
1
2
2
– Degrees of freedom is given by
2
s
s 
  
n1 n2 

df 
2
2
s1 / n1
s 22 / n2

n1
n2
2
1

2
2
 

2
Mean Differences – Populations
Not Independent
•
If the samples are not independent, a
test of mean difference is done using
paired observations.
1. H0: µd = µd0 versus HA: µd ≠ µd0
2. H0: µd ≤ µd0 versus HA: µd > µd0
3. H0: µd ≥ µd0 versus HA: µd < µd0
Mean Differences – Populations Not Independent
• To calculate the t-statistic, we first need to find the sample mean
difference:
n
1
d   di
n i 1
• The sample variance is
 d
s 
2
d
i 1

2
n
i
d
n 1
• The standard deviation of the mean is
sd
sd

n
• The test statistic, with n – 1 df is,
d  d 0
t
sd
Hypothesis Tests Concerning Variance
• We examine two types:
– tests concerning the value of a single population
variance and
– tests concerning the differences between two
population variances.
• We can formulate hypotheses as follows:
1. H 0 :  2   02 versus H a :  2   02
2. H 0 :    versus H a :   
2
2
0
2
2
0
3. H 0 :  2   02 versus H a :  2   02
Tests Concerning the Value of a
Population Variance (Normal Dist)
(n  1)s
 
, n  1 df
2
0
2
2
• where,
 X
n
s2 
i 1
i
X
n 1

2
Tests Concerning the Equality of Two Variances
• We can formulate hypotheses as follows:
1. H 0 : 12   22 versus H a : 12   22
2. H 0 : 12   22 versus H a : 12   22
3. H 0 : 12   22 versus H a : 12   22
• Suppose we have two samples, the first with n1 observations and the
second with n2 observations
s12
F  2 , with (n1  1) and (n 2  1) degrees of freedom
s2
Nonparametric Inference
• A nonparametric test is not concerned with a
parameter or makes minimal assumptions about
the population being sampled.
• A nonparametric test is primarily used in three
situations: when data do not meet distributional
assumptions, when data are given in ranks, or
when the hypothesis we are addressing does
not concern a parameter.
The Spearman Rank Correlation
Coefficient
• The Spearman rank correlation coefficient
is calculated on the ranks of two variables
within their respective samples.
n
rS  1 
t
6 d i2

i 1
2

n n 1
(n  2)1/ 2 rS
1  r 
2 1/ 2
S