Download Review of Discrete Probability (contd.)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
#
!
Stat 504, Lecture 2
1
Review of
Discrete Probability
(contd.)
Overview of probability and inference
!
#!
!
Probability
Data generating
process
"
"
#
$
The basic problem we study in probability: Given a
data generating process, what are the properties of the
outcomes?
2
Bernoulli distribution
The most basic of all discrete random variables is the
Bernoulli. X is said to have a Bernoulli distribution if
X = 1 occurs with probability p and X = 0 occurs
with probability 1 − p,
8
>
x=1
>
< p
f (x) =
>
>
:
and the variance of a Bernoulli is
V (X)
$
"
!
#
!
3
Suppose that X1 , X2 , . . . , Xn are independent and
identically distributed (iid) Bernoulli random
variables, each having the distribution
=
Suppose that an experiment consists of n repeated
Bernoulli-type trials, each trial resulting in a
“success” with probability p and a “failure” with
probability 1 − p. If all the trials are
independent—that is, if the probability of success on
any trial is unaffected by the outcome of any other
trial—then the total number of successes in the
experiment will have a binomial distribution. The
binomial distribution can be written as
"
12 p + 02 (1 − p) − p2
p(1 − p).
$
#
Stat 504, Lecture 2
4
One can show algebraically that if X ∼ Bin(n, p) then
E(X) = np and V (X) = np(1 − p). An easier way to
arrive at these results is to note that
X = X1 + X2 + . . . + Xn where X1 , X2 , . . . , Xn are iid
Bernoulli random variables. Then, by the additive
properties of mean and variance,
X ∼ Bin(n, p).
n!
px (1−p)n−x
x! (n − x)!
=
E(X 2 ) − ( E(X) )2
The Bernoulli distribution is a special case of the
binomial with n = 1. That is, X ∼ Bin(1, p) means
that X has a Bernoulli distribution with success
probability p.
f (xi ) = pxi (1 − p)1−xi
for xi = 0, 1.
Pn
Let X = i=1 Xi . Then X is said to have a binomial
distribution with parameters n and p,
f (x) =
for x = 0, 1.
Suppose an experiment has only two possible
outcomes, “success” and “failure,” and let p be the
probability of a success. If we let X denote the
number of successes (either zero or one), then X will
be Bernoulli. The mean of a Bernoulli is
"
Binomial distribution
otherwise.
f (x) = px (1 − p)1−x
=
Stat 504, Lecture 2
x=0
0
E(X) = 1(p) + 0(1 − p) = p,
The basic problem of statistical inference: Given the
outcomes, what can we say about the process that
generated the data?
(ref: Wasserman(2004))
1−p
Another common way to write it is
Observed data
Inference
$"
#
!
Stat 504, Lecture 2
E(X)
=
E(X1 ) + E(X2 ) + · · · + E(Xn )
=
np
=
V (X1 ) + V (X2 ) + · · · + V (Xn )
=
np(1 − p).
and
V (X)
for x = 0, 1, 2, . . . , n.
$
"
$
!
Stat 504, Lecture 2
#
5
#
!
Stat 504, Lecture 2
6
Poisson distribution
The Poisson is a limiting case of the binomial.
Suppose that X ∼ Bin(n, p) and let n → ∞ and p → 0
in such a way that np → λ where λ is a constant.
Then, in the limit, X will have a Poisson distribution
with parameter λ. The notation X ∼ P (λ) will mean
“X has a Poisson distribution with parameter λ.” The
Poisson probability distribution is
Note that X will not have a binomial distribution if
the probability of success p is not constant from trial
to trial, or if the trials are not entirely independent
(i.e. a success or failure on one trial alters the
probability of success on another trial).
λx e−λ
x = 0, 1, 2, . . .
x!
The mean and the variance of the Poisson are both λ;
that is, E(X) = V (X) = λ. Note that the parameter
λ must always be positive; negative values are not
allowed.
f (x) =
If X1 ∼ Bin(n1 , p) and X2 ∼ Bin(n2 , p), then
X1 + X2 ∼ Bin(n1 + n2 , p)
Because the Poisson is limit of the Bin(n, p), it is
useful as an approximation to the binomial when n is
large and p is small. That is, if n is large and p is
small, then
As n increases, for fixed p, the binomial distribution
approaches normal distribution N (np, np(1 − p)).
n!
λx e−λ
px (1 − p)n−x ≈
x! (n − x)!
x!
(1)
where λ = np. The right-hand side of (1) is typically
less tedious and easier to calculate than the left-hand
side.
"
$
"
$
!
#
!
#
Stat 504, Lecture 2
7
8
Likelihood function
One of the most fundamental concepts of modern
statistics is that of likelihood. In each of the discrete
random variables we have considered thus far, the
distribution depends on one or more parameters that
are, in most statistical applications, unknown. In the
Poisson distribution, the parameter is λ. In the
binomial, the parameter of interest is p (since n is
typically fixed and known).
Aside from its use as an approximation to the
binomial, the Poisson distribution is also an
important probability model in its own right.
It is often used to model discrete events occurring in
time or in space.
Likelihood is a tool for summarizing the data’s
evidence about parameters. Let us denote the
unknown parameter(s) of a distribution generically by
θ. Since the probability distribution depends on θ, we
can make this dependence explicit by writing f (x) as
f (x ; θ). For example, in the Bernoulli distribution the
parameter is θ = p, and the distribution is
For example, suppose that X is the number of
telephone calls arriving at a switchboard in one hour.
Suppose that in the long run, the average number of
telephone calls per hour is λ. Then it may be
reasonable to assume X ∼ P (λ). For the Poisson
model to hold, however, the average arrival rate λ
must be fairly constant over time; that is, there should
be no systematic or predictable changes in the arrival
rate. Moreover, the arrivals should be independent of
one another; that is, the arrival of one call should not
make the arrival of another call more or less likely.
"
Stat 504, Lecture 2
f (x ; p) = px (1 − p)1−x
$
x = 0, 1.
(2)
Once a value of X has been observed, we can plug
this observed value x into f (x ; p) and obtain a
function of p only. For example, if we observe X = 1,
then plugging x = 1 into (2) gives the function p. If
we observe X = 0, the function becomes 1 − p.
"
$
!
Stat 504, Lecture 2
#
9
Whatever function of the parameter results when we
plug the observed data x into f (x ; θ) is called the
likelihood function.
We will write the likelihood function as
Q
L(θ ; x) = n
i=1 f (Xi ; θ) or sometimes just L(θ).
!
Stat 504, Lecture 2
Now suppose that we observe a value of X, say
X = 1. Plugging x = 1 into the distribution
px (1 − p)1−x gives the likelihood function L(p ; x) = p,
which looks like this:
&#%'!$
Algebraically, the likelihood L(θ ; x) is just the same
as the distribution f (x ; θ), but its meaning is quite
different because it is regarded as a function of θ
rather than a function of x. Consequently, a graph of
the likelihood usually looks very different from a
graph of the probability distribution.
$!#
%
#
"#!$
!"#
!
#
$
!
Stat 504, Lecture 2
"
$
#
!
12
Suppose that an experiment consists of n = 5
independent Bernoulli trials, each having probability
of success p. Let X be the total number of successes
in the trials, so that X ∼ Bin(5, p). If the outcome is
X = 3, the likelihood is
L(p ; x)
=
=
∝
Stat 504, Lecture 2
#
It’s interesting that this function reaches its
maximum value at p = .6. An intelligent person
would have said that if we observe 3 successes in 5
trials, a reasonable estimate of the long-run
proportion of successes p would be 3/5 = .6.
n!
px (1 − p)n−x
x! (n − x)!
5!
p3 (1 − p)5−3
3! (5 − 3)!
This example suggests that it may be reasonable to
estimate an unknown parameter θ by the value for
which the likelihood function L(θ ; x) is largest. This
approach is called maximum-likelihood (ML)
estimation. We will denote the value of θ that
maximizes the likelihood function by θ̂, read “theta
hat.” θ̂ is called the maximum-likelihood estimate
(MLE) of θ.
p3 (1 − p)2
where the constant at the beginning is ignored. A
graph of L(p; x) = p3 (1 − p)2 over the unit interval
p ∈ (0, 1) looks like this:
"
L(θ ; x) summarizes the evidence about θ contained in
the event X = x. L(θ ; x) is high for values of θ that
make X = x likely, and small for values of θ that
make X = x unlikely. In the Bernoulli example,
observing X = 1 gives some (albeit weak) evidence
that p is nearer to 1 than to 0, so the likelihood for
x = 1 rises as p moves from 0 to 1.
$
11
Maximum-likelihood (ML) estimation
$
For discrete random variables, a graph of the
probability distribution f (x ; θ) has spikes at specific
values of x, whereas a graph of the likelihood L(θ ; x)
is a continuous curve (e.g. a line) over the parameter
space, the domain of possible values for θ.
For example, suppose that X has a Bernoulli
distribution with unknown parameter p. We can
graph the probability distribution for any fixed value
of p. For example, if p = .5 we get this:
"
#
10
$
"
$
#
!
Stat 504, Lecture 2
13
!
Stat 504, Lecture 2
#
14
Finding MLE’s usually involves techniques of
differential calculus. To maximize L(θ ; x) with
respect to θ:
• first calculate the derivative of L(θ ; x) with
respect to θ,
In Stat 504 you will not be asked to derive MLE’s by
yourself. In most of the probability models that we
will use later in the course (logistic regression,
loglinear models, etc.) no explicit formulas for MLE’s
are available, and we will have to rely on computer
packages to calculate the MLE’s for us. For the
simple probability models we have seen thus far,
however, explicit formulas for MLE’s are available and
are given next.
• set the derivative equal to zero, and
• solve the resulting equation for θ.
These computations can often be simplified by
maximizing the loglikelihood function,
l(θ ; x) = log L(θ ; x),
where “log” means natural log (logarithm to the base
e). Because the natural log is an increasing function,
maximizing the loglikelihood is the same as
maximizing the likelihood. The loglikelihood often
has a much simpler form than the likelihood and is
usually easier to differentiate.
"
$
"
$
!
15
#
!
16
Stat 504, Lecture 2
ML for Bernoulli trials. If our experiment is a single
Bernoulli trial and we observe X = 1 (success) then
the likelihood function is L(p ; x) = p. This function
reaches its maximum at p̂ = 1. If we observe X = 0
(failure) then the likelihood is L(p ; x) = 1 − p, which
reaches its maximum at p̂ = 0. Of course, it is
somewhat silly for us to try to make formal inferences
about θ on the basis of a single Bernoulli trial; usually
multiple trials are available.
=
=
=
"
=
#
Differentiating the log of L(p ; x) with respect to p and
setting the derivative to zero shows that this function
Pn
P
achieves a maximum at p̂ = n
i=1 xi /n. Since
i=1 xi
is the total number of successes observed in the n
trials, p̂ is the observed proportion of successes in the
n trials. We often call p̂ the sample proportion to
distinguish it from p, the “true” or “population”
proportion. For repeated Bernoulli trials, the MLE p̂
is the sample proportion of successes.
Suppose that X = (X1 , X2 , . . . , Xn ) represents the
outcomes of n independent Bernoulli trials, each with
success probability p. The likelihood for p based on X
is defined as the joint probability distribution of
X1 , X2 , . . . , Xn . Since X1 , X2 , . . . , Xn are iid random
variables, the joint distribution is
L(p ; x)
Stat 504, Lecture 2
f (x ; p)
n
Y
f (xi ; p)
i=1
n
Y
pxi (1 − p)1−xi
i=1
Pn
p
i=1
xi
Pn
(1 − p) n−
i=1
xi
.
$
"
$
#
!
Stat 504, Lecture 2
17
ML for Binomial. Suppose that X is an observation
from a binomial distribution, X ∼ Bin(n, p), where n
is known and p is to be estimated. The likelihood
function is
L(p ; x) =
#
!
Stat 504, Lecture 2
18
The fact that the MLE based on n independent
Bernoulli random variables and the MLE based on a
single binomial random variable are the same is not
surprising, since the binomial is the result of n
independent Bernoulli trials anyway. In general,
whenever we have repeated, independent Bernoulli
trials with the same probability of success p for each
trial, the MLE will always be the sample proportion
of successes. This is true regardless of whether we
know the outcomes of the individual trials
X1 , X2 , . . . , Xn , or just the total number of successes
P
for all trials X = n
i=1 Xi .
n!
px (1 − p)n−x ,
x! (n − x)!
which, except for the factor n!/(x! (n − x)!), is
identical to the likelihood from n independent
P
Bernoulli trials with x = n
i=1 xi . But since the
likelihood function is regarded as a function only of
the parameter p, the factor n!/(x! (n − x)!) is a fixed
constant and does not affect the MLE. Thus the MLE
is again p̂ = x/n, the sample proportion of successes.
"
$
"
$
!
19
#
!
20
Stat 504, Lecture 2
Suppose now that we have a sample of iid binomial
random variables. For example, suppose that
X1 , X2 , . . . , X10 are an iid sample from a binomial
distribution with n = 5 and p unknown. Since each
Xi is actually the total number of successes in 5
independent Bernoulli trials, and since the Xi ’s are
P
independent of one another, their sum X = 10
i=1 Xi
is actually the total number of successes in 50
independent Bernoulli trials. Thus X ∼ Bin(50, p)
and the MLE is p̂ = x/n, the observed proportion of
successes across all 50 trials. Whenever we have
independent binomial random variables with a
common p, we can always add them together to get a
single binomial random variable.
ML for Poisson. Suppose that X = (X1 , X2 , . . . , Xn )
are iid observations from a Poisson distribution with
unknown parameter λ. The likelihood function is
L(λ ; x)
=
n
Y
f (xi ; λ)
i=1
=
=
n
Y
λxi e−λ
xi !
i=1
Pn
λ i=1 xi e−nλ
x1 ! x2 ! · · · xn !
By differentiating the log of this function with respect
to λ, one can show that the maximum is achieved at
P
λ̂ = n
i=1 xi /n. Thus, for a Poisson sample, the MLE
for λ is just the sample mean.
Adding the binomial random variables together
produces no loss of information about p if the model
is true. But collapsing the data in this way may limit
our ability to diagnose model failure, i.e. to check
whether the binomial model is really appropriate.
"
#
Stat 504, Lecture 2
Next: What happens to the loglikelihood as n gets
large
$
"
$