Download 2 Probability and Distribution Theories

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Probability wikipedia , lookup

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
September Statistics for MSc
Weeks 1 - 2
Probability and Distribution Theories
Ali C. Tasiran
Department of Economics, Mathematics and Statistics
Malet Street, London WC1E 7HX
September 2014
Contents
1 Introduction
1.1 Textbooks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Some preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
2
4
2 Probability
2.1 Probability definitions and concepts . . .
2.1.1 Classical definition of probability .
2.1.2 Frequency definition of probability
2.1.3 Subjective definition of probability
2.1.4 Axiomatic definition of probability
Problems . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
5
5
5
5
5
5
8
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3 Random variables and probability distributions
3.1 Random variables, densities, and cumulative distribution functions
3.1.1 Discrete Distributions . . . . . . . . . . . . . . . . . . . . .
3.1.2 Continuous Distributions . . . . . . . . . . . . . . . . . . .
3.1.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
9
9
10
10
11
12
12
4 Expectations and moments
4.1 Mathematical Expectation and Moments .
4.1.1 Mathematical Expectation . . . . .
4.1.2 Moments . . . . . . . . . . . . . .
Problems . . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
13
13
13
17
19
1
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Chapter 1
Introduction
1.1
Textbooks
Lecture notes are provided. However, these are not a substitute for a textbook. I do not
recommend any particular text, but in the past students have found the following useful.
• Greene, W.H., (2004) Econometric Analysis, 5rd edition, Prentice-Hall. A good summary of much of the material can be found in Appendix.
• Hogg, R. V. and Craig A. T., (1995) Introduction to Mathematical Statistics, 5th
edition, Prentice Hall. A popular textbook, even though it is slightly dated.
• Mittelhammer. R. C., (1999) Mathematical Statistics for Economics and Business,
Springer Verlag. A good text. A good mathematical statistics textbook for economists,
it is useful especially for further econometric studies.
• Mood. A.M., Graybill, and Boes D.C., (1974) Introduction to the Theory of Statistics,
3rd edition, McGraw-Hall.
• Spanos. A., (1999) Probability Theory and Statistical Inference, Econometric Modeling
with Observational Data, Cambridge University Press.
• Wackerly. D., Mendenhall W., and Scheaffer. R., (1996) Mathematical Statistics with
Applications, 5th edition, Duxbury Press.
Those who plan to take forthcoming courses in Econometrics may buy the book by
Green (2004).
Welcome to this course.
Ali Tasiran
[email protected]
1.2
Some preliminaries
Statistics is the science of observing data and making inferences about the characteristics
of a random mechanism that has generated data. It is also called as science of uncertainty.
2
September Statistics
3
In Economics, theoretical models are used to analyze economic behavior. Economic
theoretical models are deterministic functions but in real world, the relationships are not
exact and deterministic rather than uncertain and stochastic. We thus employ distribution
functions to make approximations to the actual processes that generate the observed data.
The process that generates data is known as the data generating process (DGP or Super
Population). In Econometrics, to study the economic relationships, we estimate statistical models, which are build under guidance of the theoretical economic models and by
taking into account the properties in the data generating process.
Using parameters of estimated statistical models, we make generalisations about the
characteristics of a random mechanism that has generated data. In Econometrics, we use
observed data in the samples to draw conclusions about the populations. Populations
are either real which the data came or conceptual as processes by which the data were
generated. The inference in the first case is called design-based (for experimental data)
and used mainly to study samples from populations with known frames. The inference in
the second case is called model-based (for observational data) and used mainly to study
stochastic relationships.
The statistical theory that used for such analyses is called as the Classical inference
one will be followed in this course. It is based on two premises:
1. The sample data constitute the only relevant information
2. The construction and assessment on the different procedures for inference are based
on long-run behavior under similar circumstances.
The starting point of an investigation is an experiment. An experiment is a random
experiment if it satisfies the following conditions:
- all possible distinct outcomes are known ahead of time
- the outcome of a particular trial is not known a priori
- the experiment can be duplicated.
The totality of all possible outcomes of the experiment is referred to as the sample
space (denoted by S) and its distinct individual elements are called the sample points or
elementary events. An event, is a subset of a sample space and is a set of sample points
that represents several possible outcomes of an experiment.
A sample space with a finite or countably infinite sample points (with a one to one
correspondence to positive integers) is called a discrete space.
A continuous space is one with an uncountable infinite number of sample points (that
is, it has as many elements as there are real numbers).
Events are generally represented by sets, and some important concepts can be explained
by using the algebra of sets (known as Boolean Algebra).
Definition 1 The sample space is denoted by S. A = S implies that the events in A must
always occur. The empty set is a set with no elements and is denoted by . A = implies
that the events in A do not occur.
The set of all elements not in A is called the complement of A and is denoted by Ā.
Thus, Ā occurs if and only if A does not occur.
The set of all points in either a set A or a set B or both is called the union of the two
sets and is denoted by ∪. A ∪ B means that either the event A or the event B or both
occur. Note: A ∪ Ā = S.
September Statistics
4
The set of all elements in both A and B is called the intersection of the two sets and
is represented by ∩. A ∩ B means that both the events A and B occur simultaneously.
A ∩ B = means that A and B cannot occur together. A and B are said to be disjoint
or mutually exclusive. Note: A ∩ Ā = .
A ⊂ B means that A is contained in B or that A is a subset of B, that is, every
element of A is an element of B. In other words, if an event A has occurred, then B must
have occurred also.
Sometimes it is useful to divide elements of a set A into several subsets that are disjoint.
Such a division is known as a partition. If A1 and A2 are such partitions, then A1 ∩A2 = and A1 ∪ A2 = A. This can be generalized to n partitions; A = ∪n1 Ai with Ai ∩ Aj = for
i 6= j.
Some postulates according to the Boolean Algebra:
Identity: There exist unique sets and S such that, for every set A, A ∩ S = A and
A ∪ = A.
Complementation: For each A we can define a unique set Ā such that A ∩ Ā = and
A ∪ Ā = S.
Closure: For every pair of sets A and B, we can define unique sets A ∪ B and A ∩ B.
Commutative: A ∪ B = B ∪ A; A ∩ B = B ∩ A.
Associative: (A ∪ B) ∪ C = A ∪ (B ∪ C).
Also (A ∩ B) ∩ C = A ∩ (B ∩ C).
Distributive: A ∩ (B ∪ C) = (A ∩ B) ∪ (A ∩ C).
Also, A ∪ (B ∩ C) = (A ∪ B) ∩ (A ∪ C).
Morgan’s Laws: A ∪ B) = Ā ∩ B̄.
(A ∩ B) = Ā ∪ B̄.
Problems
1. Let the set S contains the ordered combination of sexes of two children
S = {F F, F M, M F, M M }.
Let A denote the subset of possibilities containing no males, B the subset of two
males, and C the subset containing at least one male. List the elements of A, B, C,
A ∩ B, A ∪ B, A ∩ C, A ∪ C, B ∩ C, B ∪ C, and C ∩ B̄.
2. Verify Morgan’s Laws by drawing Venn Diagrams.
A ∪ B = Ā ∩ B̄.
(A ∩ B) = Ā ∪ B̄.
Chapter 2
Probability
2.1
2.1.1
Probability definitions and concepts
Classical definition of probability
If an experiment has n(n < ∞) mutually exclusive and equally likely outcomes, and if nA
of these outcomes have an attribute A (that is, the event A occurs in nA possible ways),
then the probability of A is nA /n, denoted as P (A) = nA /n
2.1.2
Frequency definition of probability
Let nA be the number of times the event A occurs in n trials of an experiment. If there
exists a real number p such that p = limn→∞ (nA /n), then p is called the probability of A
and is denoted as P (A). (Examples are histograms for frequency distribution of variables).
2.1.3
Subjective definition of probability
Our personal judgments to assess the relative likelihood of various outcomes. They are based
on our ”educated guesses” or intuitions. ”The weather will be rainy with a probability 0.6,
tomorrow”.
2.1.4
Axiomatic definition of probability
The probability of an event A ∈ z is a real number such that
1) P (A) ≥ 0 for every A ∈ z,
2) the probability of the entire sample space S is 1, that is P (S) = 1, and
3) if A1 , A2 , ..., An are mutually
exclusive events (that is, Ai ∩ Aj = for all i 6= j),
P
then P (A1 ∪ A2 ∪ ...An ) = i P (Ai ), and this holds for n = ∞ also.
Where z is a set of any collection of sub-sets in the sample space, S. The triple
(S, z, P (·)) is referred to as the probability space, and P (·) is a probability measure.
We can derive the following theorems by using the axiomatic Definition of probability.
Theorem 2 P (Ā) = 1 − P (A).
Theorem 3 P (A) ≤ 1.
5
September Statistics
6
Theorem 4 P () = 0.
Theorem 5 If A ⊂ B, then P (A) ≤ P (B).
Theorem 6 P (A ∪ B) = P (A) + P (B) − P (A ∩ B).
Definition 7 Let A and B be two events in a probability space (S, z, P (.)) such that
P (B) > 0. The conditional probability of A given that B has occurred, denoted by
P (A | B), is given by P (A ∩ B)/P (B). (It should be noted that the original probability space
(S, z, P (·)) remains unchanged even though we focus our attention on the subspace, this is
(S, z, P (· | B))
Theorem 8 Bonferroni’s Theorem: Let A and B be two events in a sample space S. Then
P (A ∩ B) ≥ 1 − P (Ā) − P (B̄).
Theorem 9 Bayes Theorem: If Aand B are two events with positive probabilities, then
P (A | B) =
P (A) P (B | A)
P (B)
Law of total probability
Assume that S = A1 ∪ A2 ∪ ... ∪ An where Ai ∩ Aj = ∅ for i 6= j. Then for any event
B⊂S
n
X
P (B) =
P (Ai )P (B | Ai ).
i=1
Theorem 10 Extended Bayes Theorem: If A1 , A2 , ..., An constitute a partition of the sample space, so that Ai ∩ Aj = for i 6= j and ∪i Ai = S, and P (Ai ) 6= 0 for any i, then for
a given event B with P (B) > 0,
P (Ai ) P (B | Ai )
P (Ai | B) = P
i P (Ai ) P (B | Ai )
Definition 11 Two events A and B with positive probabilities are said to be statistically
independent if and only if P (A | B) = P (A). Equivalently, P (B | A) = P (B) and
P (A ∩ B) = P (A)P (B).
The other type of statistical inference is called Bayesian inference where sample information is combined with prior information. This is expressed of a probability distribution
known as the prior distribution. When it is combined with the sample information then a
posterior distribution of parameters is obtained. It can be derived by using Bayes Theorem.
If we substitute Model (the model that generated the observed data) for A and Data
(Observed Data) for B, then we have
P (M odel | Data) =
P (Data | M odel)P (M odel)
P (Data)
(2.1)
where P (Data | M odel) is the probability of observing data given that the Model is true.
This is usually called the likelihood, (sample information). P (M odel) is the probability
September Statistics
7
that the Model is true before observing the data (usually called the prior probability).
P (M odel | Data) is the probability that the Model is true after observing the data (usually
called posterior probability). P (Data) is the unconditional probability of observing data
(whether the Model is true or not). Hence, the relation can be written
P (M odel | Data) ∝ P (Data | M odel)P (M odel)
(2.2)
That is, that Posterior probability is proportional to likelihood (sample information) times
prior probability. The inverse of an estimator’s variance is called as the precision. In
Classical Inference, we use only parameter’s variances but in Bayesian Inference, we have
both sample precision and prior precision. Also, the precision (or inverse of the variance) of
the posterior distribution of a parameter is the sum of sample precision and prior precision.
For example, the posterior mean will lie between the sample mean and the prior mean. The
posterior variance will be less than the both the sample and prior variances. These are the
reasons behind the increasing popularity of Bayesian Inference in the practical econometric
applications.
When we speak in econometrics of models to be estimated or tested, we refer to sets of
DGPs in Classical Inference context. In design-based inference, we restrict our attention
to a particular sample size and characterize a DGP by the law of probability that governs
the random variables in a sample of that size. In model based inference, we refer to a
limiting process in which the sample size goes to infinity, it is clear that such a restricted
characterization will no longer suffice. When we indulge in asymptotic theory, the DGPs
in question must be stochastic processes. A stochastic process is a collection of random
variables indexed by some suitable index set. This index set may be finite, in which case we
have no more than a vector of random variables, or it may be infinite, with either a discrete
or a continuous infinity of elements. In order to define a DGP, we must be able to specify the
joint distribution of the set of random variables corresponding to the observations contained
in a sample of arbitrarily large size. This is a very strong requirement. In econometrics,
or any other empirical discipline for that matter, we deal with finite samples. How then
can we, even theoretically, treat infinite samples? We must in some way create a rule that
allows one to generalize from finite samples to an infinite stochastic process. Unfortunately,
for any observational framework, there is an infinite number of ways in which such a rule
can be constructed, and different rules can lead to widely asymptotic conclusions. In the
process of estimating an econometric model, what we are doing is to try to obtain some
estimated characterization of the DGP that actually did generate the data. Let us denote
an econometric model that is to be estimated, tested, or both, as M and a typical DGP
belonging to M as µ.
The simplest model in econometrics is the linear regression model, one possibility is to
write
y = Xβ + u, u ∼ N (0, σ 2 In )
(2.3)
where y and u are n-vectors and X is a nonrandom nxk matrix and y follows the N (Xβ, σ 2 In )
distribution. This distribution is unique if the parameters β and σ 2 are specified. We may
therefore say that the DGP is completely characterized by the model parameters. In
other words, knowledge of the model parameters β and σ 2 uniquely identify an element of
µ in M .
On the other hand, the linear regression model can also be written as
y = Xβ + u, u ∼ IID(0, σ 2 In )
(2.4)
September Statistics
8
with no assumption of normality. Many aspects of the theory of linear regressions are just
0
applicable, the OLS estimator is unbiased, and its covariance matrix is σ 2 (X X)−1 . But the
distribution of the vector u, and hence also that of y, is now only partially characterized
even when β and σ 2 are known. For example, errors u could be skewed to the left or to the
right, could have fourth moments larger or smaller than 3σ 4 .Let us call the sets of DGPs
associated these regressions M1 and M2 ., respectively. M1 being in fact a proper subset of
M2 . For a given β and σ 2 there is an infinite number of DGPs in M2 (only one of which
is M1 ) that all correspond to the same β and σ 2 . Thus we must consider these models as
different models even though the parameters used in them are the same. In either case, it
must be possible to associate a parameter vector in a unique way to any DGP µ in the model
M , even if the same parameter vector is associated with many DGPs. We call the model M
with its associated parameter-defining mapping θ as a parametrized model The main
task in our practical work is to build the association between the DGPs of a model and the
model parameters. For example, in the Generalized Method of Moments (GMM) context,
there are many possible ways of choosing the econometric model, i.e., the underlying set of
DGPs. One of the advantages of GMM as an estimation method is that it permits models
which consist of a very large number of DGPs. In striking contrast to Maximum Likelihood
estimation, where the model must be completely specified, any DGP is admissible if it
satisfies a relatively small number of restrictions or regularity conditions. Sometimes, the
existence of the moments used to define the parameters is the only requirement needed for
a model to be well defined.
Problems
1. A sample space consists of five simple events E1 , E2 , E3 , E4 , and E5 .
(a) If P (E1 ) = P (E2 ) = 0.15, P (E3 ) = 0.4 and P (E4 ) = 2P (E5 ), find the probabilities of E4 and E5 .
(b) If P (E1 ) = 3P (E2 ) = 0.3, find the remaining simple events if you know that the
remaining events are equally probable.
2. A business office orders paper supplies from one of three vendors, V1 , V2 , and V3 .
Orders are to be placed on two successive days, one order per day. Thus (V2 , V3 )
might denote that vendor V2 gets the order on the first day and vendor V3 gets the
order on the second day.
(a) List the sample points in this experiment of ordering paper on two successive
days.
(b) Assume the vendors are selected at random each day and assign a probability to
each sample point.
(c) Let A denote the event that the same vendor gets both orders and B the event
that V2 gets at least one order. Find P (A), P (B), P (A ∩ B), and P (A ∪ B) by
summing probabilities of the sample points in these events.
Chapter 3
Random variables and probability
distributions
3.1
Random variables, densities, and cumulative distribution
functions
A random variable X, is a function whose domain is the sample space and whose range is
a set of real numbers.
Definition 12 In simple terms, a random variable (also referred as a stochastic variable) is a real-valued set function whose value is a real number determined by the outcome
of an experiment. The range of a random variable is the set of all the values it can assume.
The particular values observed are called realisations x. If these are countable, x1 , x2 , ...,
it is said to be discrete with associated probabilities
X
P (X = xi ) = p(xi ) ≥ 0,
p(xi ) = 1;
(3.1)
i
Pj
and cumulative distribution P (X ≤ xj ) = i=1 p(xi ).
For a continuous random variable, defined over the real line, the cumulative distribution
function is
Z x
f (u)d(u),
(3.2)
F (x) = P (X ≤ x) =
−∞
where denotes the probability density function
f (x) =
dF (x)
dx
(3.3)
R∞
and −∞ f (x)d(x) = 1.
Also note that the cumulative distribution function satisfies limx→∞ F (x) = 1 and
limx→−∞ F (x) = 0.
Definition 13 The real-valued function F (x) such that F (x) = Px {(−∞, x]} for each x ∈
< is called the distribution function, also known as the cumulative distribution (or
cumulative density) function, or CDF.
9
September Statistics
10
Theorem 14 P (a ≤ X ≤ b) = F (b) − F (a)
Theorem 15 For each x ∈ <, F (x) is continuous to the right of x.
Theorem 16 If F (x) is continuous at x ∈ <, then P (X = x) = 0.
Although f (x) is defined at a point, P (X = x) = 0 for a continuous random variable.
The support of a distribution is the range over which f (x) 6= 0.
Let f be a function from Rk to R. Let x0 be a vector in Rk and let y = f (x0 ) be its
k
image. The function f is continuous at x0 if whenever {xn }∞
n=1 is a sequence in R which
∞
converges to x0 , then the sequence {f (xn )}n=1 converge to f (x0 ). The function f is said
to be continuous if it is continuous at each point in its domain.
All polynomial functions are continuous. As an example of a function that is not continuous consider
1, if x > 0,
f (x) =
0, if x ≤ 0.
If both g and f are continuous functions, then g(f (x)) is continuous.
3.1.1
Discrete Distributions
Definition 17 For a discrete random variable X, let f (x) = Px (X = x). The function
f (x) is called the probability function (or as probability mass function).
The Bernoulli Distribution
f (x; θ) = f (x; p) = px (1 − p)1−x for x = 0, 1(failure, success) and 0 ≤ p ≤ 1.
The Binomial Distribution
n x
n!
f (x; θ) = B(x; n, p) =
p (1 − p)n−x =
px (1 − p)n−x
x
x! (n − x)!
(3.4)
x = 0, 1, ..., n (X is the number of success in n trials) 0 ≤ p ≤ 1.
3.1.2
Continuous Distributions
Definition 18 For a random variable X if there exists a nonnegative function f (x), defined
on the real line, such that for any interval B,
Z
P (X ∈ B) =
f (x) dx
(3.5)
B
then X is said to have a continuous distribution and the function f (x) is called the
probability density function or simply density function (or pdf).
The following can be written for the continuous random variables:
Z x
F (x) =
f (u) d(u)
(3.6)
−∞
September Statistics
11
f (x) = F 0 (x) =
Z
∂F (x)
∂x
(3.7)
+∞
f (u) d(u) = 1
(3.8)
−∞
b
Z
F (b) − F (a) =
f (u) d(u)
(3.9)
a
Uniform Distribution on an Interval
A random variable X with the density function
1
(b − a)
f (x; a, b) =
(3.10)
in the interval a ≤ X ≤ b is called the uniform distribution on an interval.
The Normal Distribution
A random variable X with the density function
1 (x − µ)2

− 
2
σ2

1
e
f (x; µ, σ) = p
σ (2π)

(3.11)
is called a Normal (Gaussian) distributed variable.
3.1.3
Example
1. Toss of a single fair coin. X =number of heads

if x < 0
 0,
1
,
if
0
≤
x<1
F (x) =
 2
1,
if x ≤ 1
the cumulative distribution function (cdf) of discrete random variables are always step
functions because the cdf increases only at a countable of number of points.
1
2 , if x = 0
f (x) =
1
2 , if x = 1
F (x) =
X
xj ≤x
f (xj )
September Statistics
3.2
12
Problems
1. Write P (a ≤ x ≤ b) in terms of integrals and draw a picture for it.
2. Assume the probability density function for x is:
cx, if 0 ≤ x ≤ 2
f (x) =
0,
elsewhere
(a) Find the value of c for which f (x) is a pdf.
(b) Compute F (x).
(c) Compute P (1 ≤ x ≤ 2).
3. The large lot of electrical is supposed to contain only 5 percent defectives assuming
a binomial model. If n = 20 fuses are randomly sampled from this lot, find the
probability that at least three defectives will be observed.
4. Let the distribution function of a random variable X be given by

0,
x<0


 x, 0 ≤ x < 2
8
F (x) =
x2

, 2≤x<4

16

1,
x≥4
(a) Find the density function (i.e., pdf) of x.
(b) Find P (1 ≤ x ≤ 3)
(c) Find P (x ≤ 3)
(d) Find P (x ≥ 1 | x ≤ 3).
Chapter 4
Expectations and moments
4.1
Mathematical Expectation and Moments
The probability density and the cumulative distributions functions determine the probabilities of random variables at various points or in different intervals. Very often we are
interested in summary measures of where the distribution is located, how it is dispersed
around some average measure, whether it is symmetric around some point, and so on.
4.1.1
Mathematical Expectation
Definition 19 Let X be a random variable with f (x) as the PMF, or PDF, and g(x) be
a single-valued-function. The integral is the expected value (or mathematical expectation) of g(x) and is denoted P
by E[g(X)]. In the case of a discrete random variable
+∞
this takes the form E[g(X)] =
−∞ g(x)f (xi ), and in the continuous case, E[g(X)] =
R +∞
−∞ g(x)f (x)dx
Mean of a Distribution
For the special case of g(X) = X, the mean of a distribution is µ = E(X).
Theorem 20 If c is a constant, E(c) = c.
Theorem 21 If c is constant, E[cg(X)] = cE[g(X)].
Theorem 22 E[u(X) + v(X)] = E[u(X)] + E[v(X)].
Theorem 23 E(X − µ) = 0, where µ = E(X).
Examples:
Ex1: Let X have the probability density function
x
f (x)
1
2
3
4
4
10
1
10
3
10
2
10
13
September Statistics
E(x) =
P
x xf (x)
=1
14
4
10
+2
1
10
+3
3
10
+4
2
10
=
23
10
.
Ex2: Let X have the pdf
f (x) =
E(x) =
R +∞
−∞
xf (x)dx =
R1
0
x(4x3 )dx = 4
4x3 , 0 < x < 1
.
0, elsewhere
R1
0
x4 dx = 4
h
i1
x5
5 0
=4
1
5
= 45 .
Moments of a Distribution
The mean of a distribution is the expected value of the random variable X. If the following
integral exists
Z +∞
0
m
xm dF
(4.1)
µm = E(X ) =
−∞
0
it is called the mth moment around the origin, and it is denoted by µm . Moments can
also be obtained around the mean or the central moments (denoted by µm )
Z +∞
m
µm = E[(X − µ) ] =
(x − µ)m dF
(4.2)
−∞
Variance and Standard Deviation
The central moment of a distribution that corresponds to m = 2 is called the variance of
this distribution, and is denoted by σ 2 or V ar(X). The positive square root of the variance
is called standard deviation and is denoted by σ or Std(x). The variance is an average
of the squared deviation from the mean. There are many deviations from the mean but
only one standard deviation. The variance shows the dispersion of a distribution and by
squaring deviations one treats positive and negative deviations symmetrically.
Mean and Variance of a Normal Distribution
A random variable X is normal distributed as N (µ, σ 2 ) the mean is µ, and variance is σ 2 .
The operation of substracting the mean and dividing by the standard deviation is called
standardizing. Then the standardized variable Z = (X − µ)/σ is SN (0, 1).
Mean and Variance of a Binomial Distribution
The random variable X is binomial distributed B(n, p) with the mean np and a variance
with np(1 − p). (Show this!)
Theorem 24 If E(X)=µ and Var(X)=σ 2 , and a and b are constants, then V ar(a + bX) =
b2 σ 2 . (Show this!)
September Statistics
15
Example:
Ex3: Let X have the probability density function
4x3 , 0 < x < 1
f (x) =
.
0, elsewhere
E(x) = 54 .
V ar(x) = E(x2 )−E 2 (x) =
R1
0
x2 (4x3 )dx−
4 2
5
=4
h
i1 x6
4 2
6 0− 5
= 46 − 16
25 =
2
75
= 0.0266.
Expectations and Probabilities
Any probability can be interpreted as an expectation. Define the variable Z which is equal
to 1 if event A occurs, and equal to zero if event A does not occur. Then it is easy to see
that P r(A) = E(Z).
How much information about the probability distribution of a random variable X is
provided by the expectation and variance of X? There are three useful theorems here.
Theorem 25 Markov’s Inequality If X is nonnegative random variable, that is, if P r(X <
0) = 0, and any k is any constant, then P r(X ≥ k) ≤ E(X)/k.
Theorem 26 Chebyshev’s Inequality Let b a positive constant and h(X) be a nonnegative
measurable function of the random variable X. Then
1
Pr(h(X) ≥ b) ≤ E[h(X)]
b
For any constant c > 0 and σ 2 = V ar(X),
Corollary 27 P r(| X − µ |≥ c) ≤
σ2
c2
Corollary 28 P r(| X − µ |≤ c) ≥ 1 −
Corollary 29 P r(| X − µ |≥ kσ) ≤
σ2
c2
1
k2
For linear functions the expectation of the function is the function of the expectation.
But if Y = h(X) is nonlinear, then in general E(Y ) 6= h[E(X)]. The direction of the
inequality may depend on the distribution of X. For certain functions, we can be more
definite.
Theorem 30 Jensen’s Inequality If Y = h(X) is concave and E(X) = µ, then E(Y ) ≤
h(µ).
For example, the logarithmic function is concave, so E[log(X)] ≤ log[E(X)] regardless
of the distribution of X. Similarly, if Y = h(X) is convex, so that it lies everywhere
above its tangent line, then E(Y ) ≥ h(µ). For example, the square function is convex, so
E(X 2 ) ≥ [E(X)]2 regardless of the distribution of X.
September Statistics
16
Approximate Mean and Variance of g(X)
Suppose X is a random variable defined on (S, z, P (·)) with E(X) = µ and V ar(X) = σ 2 ,
and let g(X) be a differentiable and measurable function of X. We first take a linear
approximation of g(X) in the neighborhood of µ. This is given by
g(X) ≈ g(µ) + g 0 (µ)(X − µ)
(4.3)
provided g(µ) and g 0 (µ) exist. Since the second term zero expectation E[g(X)] ≈ g(µ), and
variance is V ar[g(X)] ≈ σ 2 [g 0 (µ)]2 .
Mode of a Distribution
The point(s) for which f (x) is maximum are called mode. It is the most frequently observed
value of X.
Median, Upper and Lower Quartiles, and Percentiles
A value of x such that P (X < x) ≤ (1/2), and P (X ≤ (x)) ≥ (1/2) is called a median
of the distribution. If the point is unique, then it is the median. Thus the median is the
point on either side of which lies 50 percent of the distribution. We often prefer median as
an ”average” measure because the arithmetic average can be misleading if extreme values
are present.
The point(s) with an area 1/4 to the left is (are) called the lower quartile(s), and the
point(s) corresponding to 3/4 is (are) called upper quartile(s).
For any probability p, the values of X, for which the area to the right is p are called the
upper pth percentiles (also referred to as quantiles).
Coefficient of Variation
The coefficient of variation is defined as the ratio (σ/µ)100, where the numerator is the
standard deviation and the denominator is the mean. It is a measure of the dispersion of a
distribution relative to its mean and useful in the estimation of relationships. We usually
say that the variable X does not vary much if the coefficient of variation is less than 5
percent. This is also helpful to make comparison between two variables that are measured
with different scales.
Skewness and Kurtosis
If a continuous density f (x) has the property that f (µ + a) = f (µ − a) for all a (µ being
the mean of the distribution), then f (x) is said to be symmetric around the mean . If
a distribution is not symmetric about the mean, then it is called skewed. A commonly used
measure of skewness is α3 = E[(X − µ)3 /σ 3 ]. For a symmetric distribution such as the
normal, this is zero(µ = α3 = 0). [Positive skewed (µ > α3 > 0), to the right with a long
tail, negative skewed (µ < α3 < 0), to the left with a long tail].
The peaknedness of a distribution is called kurtosis. One measure of kurtosis is
α4 = E[(X − µ)4 /σ 4 ]. For a normal distribution, kurtosis is called mesokurtic (α4 = 3).
A narrow distribution is called leptokurtic (α4 > 3) and a flat distribution is called
September Statistics
17
platykurtic (α4 < 3). The value E[(X − µ)4 /σ 4 ] − 3 is often referred to as excess kurtosis.
4.1.2
Moments
Mathematical Expectation
The concept of mathematical expectation is easily extended to bivariate random variables.
We have
Z Z
E[g(X, Y )] =
g(x, y)dF (x, y)
(4.4)
where the integral is over the (X, Y ) space.
Moments
The rth moment of X is
E(X r ) =
Z
xr dF (x)
(4.5)
Joint Moments
E(X r Y s ) =
Z Z
xr y s dF (x, y)
Let X and Y be independent random variables and let u(X) be a function of X only and
v(Y ) be a function of Y only. Then,
E[u(X)v(Y )] = E[u(X)]E[v(Y )]
(4.6)
Covariance
Covariance between X and Y is defined as
σXY = Cov(X, Y ) = E[(X − µx )(Y − µy )] = E(XY ) − µx µy
In the continuous case this takes the form:
Z ∞Z ∞
σXY =
(x − µx )(y − µy )f (x, y)dxdy
−∞
(4.7)
(4.8)
−∞
and in the discrete case it is
σXY =
XX
x
(x − µx )(y − µy )f (x, y)
(4.9)
y
Although the covariance measure is useful in identifying the nature of the association
between X and Y , it has a serious problem, namely, the numerical value is very sensitive
to the units of measurement. To avoid this problem, a ”normalized” covariance measure is
used. This measure is called the correlation coefficient.
September Statistics
18
Correlation
The quantity
ρXY =
σXY
Cov(X, Y )
=p
σX σY
V ar(X) V ar(Y )
(4.10)
is called correlation coefficient between X and Y . If Cov(X, Y ) = 0, then Cor(X, Y ) = 0,
in which case X and Y are said to be uncorrelated. Two random variables are independent
then σXY = 0 and ρXY = 0. The converse need not to be true.
Theorem 31 | ρXY |≤ 1 that is, −1 ≤ ρXY ≤ 1.
The inequality [Cov(X, Y )]2 ≤ V ar(X)V ar(Y )is called Cauchy-Schwarz Inequality
or ρ2XY ≤ 1 that is, −1 ≤ ρXY ≤ 1. It should be emphasized that ρXY measures only
a linear relationship between X and Y . It is possible to have an exact relation but a
correlation less than 1, even 0.
Example:
To illustrate, consider random variable X which is distributed as Uniform [−θ, θ] and the
transformation Y = X 2 . Cov(X, Y ) = E(X 3 ) − E(X)E(X 2 ) = 0 because the distribution
is symmetric around the origin and hence all the odd moments about the origin are zero.
It follows that X and Y are uncorrelated even though there is an exact relation between
them. In fact, this result holds for any distribution that is symmetric around the origin.
Definition 32 Conditional Expectation: Let X and Y be continuous random variables
and g(Y ) be a continuous function. Then the conditional expectation
(or conditional mean)
R∞
of g(Y ) given X = x, denoted by EY |X [g(Y ) | X], is given by −∞ g(y) f (y | x) dy wheref (y |
x) is the conditional density of Y given X.
Note that E[g(Y ) | X = x] is a function of x and is not a random variable because x is
fixed. The special case of E(Y | X) is called the regression of Y on X.
Theorem 33 Law of Iterated Expectation: EXY [g(Y )] = EX [EY |X {g(Y ) | X}]. That
is, the unconditional expectation is the expectation of the conditional expectation.
Definition 34 Conditional Variance: Let µY |X = E(Y | X) = µ∗ (X) be the conditional
mean of Y given X. Then the conditional variance of Y given X is defined as V ar(Y |
X) = EY |X [(Y − µ∗ )2 | X}]. This is a function of X.
Theorem 35 V arY |X (Y ) = EX [V ar(Y | X)] + V arX [E(Y | X)], that is, the variance of
Y is the mean of its conditional variance plus the variance of its conditional mean.
Theorem 36 V ar(aX + bY ) = a2 V ar(X) + 2abCov(X, Y ) + b2 V ar(Y ).
September Statistics
19
Approximate Mean and Variance for g(X, Y )
After obtaining a linear approximation of the function g(X, Y )
∂g
∂g
(X − µX ) +
(Y − µY )
g(X, Y ) ≈ g(µx , µy ) +
∂X
∂Y
its mean can be written E[g(X, Y )] ≈ g(µX , µY ).
Its variance is
∂g 2
∂g
∂g
∂g 2
2
2
V ar[g(X, Y )] ≈ σX
+ σY
+ 2ρ σX σY
∂X
∂Y
∂X
∂Y
(4.11)
(4.12)
Note that approximations may be grossly in error. You should be especially careful with
the variance and covariance approximations.
Problems
1. For certain ore samples the proportion Y of impurities per sample is a random variable
with density function given by
3 2
2 y + y, 0 ≤ y ≤ 1 .
f (y) =
0,
elsewhere
The dollar value of each sample is W = 5 − 0.5Y . Find the mean and variance of W.
2. The random variable Y has the following probability density function
3
2
8 (7 − y) , 5 ≤ y ≤ 7 .
f (y) =
0,
elsewhere
(a) Find E(Y ) and V ar(Y ).
(b) Find an interval shorter than (5, 7) in which least 3/4 of the Y values must lie.
(c) Would you expect to see a measurement below 5.5 very often? Why?