Download Discrete Random Variables

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Statistics wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Discrete Random Variables
COMP 245 STATISTICS
Dr N A Heard
Contents
1
Random Variables
1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2 Cumulative Distribution Function . . . . . . . . . . . . . . . . . . . . . . . . . . .
2
2
3
2
Discrete Random Variables
2.1 Simple Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 Discrete Random Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.3 Probability Mass Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4
4
5
6
3
Mean and Variance
3.1 Expectation . . . . . . . . . . . . . . . . . . .
3.2 Variance, Standard Deviation and Skewness
3.3 Comments . . . . . . . . . . . . . . . . . . . .
3.4 Sums of Random Variables . . . . . . . . . . .
4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8
. 8
. 9
. 10
. 10
Discrete Distributions
12
4.1 Bernoulli, Binomial and Geometric Distributions . . . . . . . . . . . . . . . . . . . 12
4.2 Poisson Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 Discrete Uniform Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1
1
1.1
Random Variables
Introduction
Definition
Suppose, as before, that for a random experiment we have identified a sample space S and
a probability measure P( E) defined on (measurable) subsets E ⊆ S.
A random variable is a mapping from the sample space to the real numbers. So if X is a
random variable, X : S → R. Each element of the sample space s ∈ S is assigned by X a (not
necessarily unique) numerical value X (s).
If we denote the unknown outcome of the random experiment as s∗ , then the corresponding
unknown outcome of the random variable X (s∗ ) will be generically referred to as X.
The probability measure P already defined on S induces a probability distribution on the
random variable X in R:
For each x ∈ R, let Sx ⊆ S be the set containing just those elements of S which are mapped
by X to numbers no greater than x. That is, let Sx be the inverse image of (−∞, x ] under the
function X. Then noting the equaivalence
X (s∗ ) ≤ x ⇐⇒ s∗ ∈ Sx .
we see that
P X ( X ≤ x ) ≡ P( S x ).
The image of S under X is called the range of the random variable:
range( X ) ≡ X (S) = { x ∈ R|∃s ∈ S s.t. X (s) = x }
So as S contains all the possible outcomes of the experiment, range(X) contains all the
possible outcomes for the random variable X.
Example
Let our random experiment be tossing a fair coin, with sample space {H, T} and probability
1
measure P({H}) = P({T}) = .
2
We can define a random variable X : {H, T} → R taking values, say,
X (T) = 0,
X (H) = 1.
In this case, what does Sx look like for each x ∈ R?

if x < 0;
 ∅
{T}
if 0 ≤ x < 1;
Sx =

{H, T} if x ≥ 1.
This defines a range of probabilities PX on the continuum R

if x < 0;
 P( ∅ ) = 0
1
P X ( X ≤ x ) = P( S x ) =
if 0 ≤ x < 1;
P({T}) = 2

P({H, T}) = 1 if x ≥ 1.
2
6
PX ( X ≤ x )
1
1
2
0
-
1
1.2
x
Cumulative Distribution Function
cdf
The cumulative distribution function (cdf) of a random variable X, written FX ( x ) (or just
F ( x )) is the probability that X takes value less than or equal to x.
FX ( x ) = PX ( X ≤ x )
For any random variable X, FX is right-continuous, meaning if a decreasing sequence of real
numbers x1 , x2 , . . . → x, then FX ( x1 ), FX ( x2 ), . . . → FX ( x ).
cdf Properties
For a given function FX ( x ), to check this is a valid cdf, we need to make sure the following
conditions hold.
1. 0 ≤ FX ( x ) ≤ 1, ∀ x ∈ R;
2. Monotonicity: ∀ x1 , x2 ∈ R, x1 < x2 ⇒ FX ( x1 ) ≤ FX ( x2 );
3. FX (−∞) = 0, FX (∞) = 1.
• For finite intervals ( a, b] ⊆ R, it is easy to check that
PX ( a < X ≤ b) = FX (b) − FX ( a).
• Unless there is any ambiguity, we generally suppress the subscript of PX (·) in our notation and just write P(·) for the probability measure for the random variable.
– That is, we forget about the underlying sample space and just think about the random variable and its probabilities.
– Often, it will be most convenient to work this way and consider the random variable
directly from the very start, with the range of X being our sample space.
3
2
2.1
Discrete Random Variables
Simple Random Variables
We say a random variable is simple if it can take only a finite number of possible values. That
is,
X is simple ⇐⇒ range( X ) is finite.
Suppose X is simple, and can take one of m values X = { x1 , x2 , . . . , xm } ordered so that
x1 < x2 < . . . < xm . Each sample space element s ∈ S is mapped by X to one of these m values.
Then in this case, we can partition the sample space S into m disjoint subsets { E1 , E2 , . . . , Em }
so that s ∈ Ei ⇐⇒ X (s) = xi , i = 1, 2, . . . , m.
We can then write down the probability of the random variable X taking the particular
value xi as
PX ( X = xi ) ≡ P( Ei ).
It is also easy to check that
PX ( X = xi ) = FX ( xi ) − FX ( xi−1 ),
where we can take x0 = −∞.
Examples of Simple Random Variables
Consider once again the experiment of rolling a single die.
Then S = { , , , , , } and for any s ∈ S (e.g. s = ) we have P({s}) = 61 .
• An obvious random variable we could define on S would be X : S → R, s.t.
X ( ) = 1,
X ( ) = 2,
..
.
X ( ) = 6.
Then e.g. PX (1 < X ≤ 5) = P({ , , , }) =
4
6
=
2
3
and PX ( X ∈ {2, 4, 6}) = P({ , , }) = 12 .
• Alternatively, we could define a random variable Y : S → R, s.t.
Y ( ) = Y ( ) = Y ( ) = 0,
Y ( ) = Y ( ) = Y ( ) = 1.
Then clearly PY (Y = 0) = P({ , , }) =
4
1
2
and PY (Y = 1) = P({ , , }) = 21 .
Comments
• Note that under either random variable X or Y, we still got the same probability of getting
an even number of spots on the die.
Indeed, this would be the case for any random variable we may care to define.
A random variable is simply a numeric relabelling of our underlying sample space, and
all probabilities are derived from the associated underlying probability measure.
2.2
Discrete Random Variables
A simple random variable is a special case of a discrete random variable. We say a random
variable is discrete if it can take only a countable number of possible values. That is,
X is discrete ⇐⇒ range( X ) is countable.
Suppose X is discrete, and can take one of the countable set of values X = { x1 , x2 , . . .}
ordered so that x1 < x2 < . . .. Each sample space element s ∈ S is mapped by X to one of these
values.
Then in this case, we can partition the sample space S into a countable collection of disjoint
subsets { E1 , E2 , . . .} s.t. s ∈ Ei ⇐⇒ X (s) = xi , i = 1, 2, . . ..
As with simple random variables we can write down the probability of the discrete random
variable X taking the particular value xi as
PX ( X = xi ) ≡ P( Ei ).
We again have
PX ( X = xi ) = FX ( xi ) − FX ( xi−1 ),
For a discrete random variable X, FX is a monotonic increasing step function with jumps
only at points in X .
1.0
Example: Poisson(5) cdf
●
●
●
●
●
●
●
●
●
●
●
●
●
0.8
●
0.6
●
F(x)
●
0.4
●
0.2
●
●
0.0
●
●
0
5
10
x
5
15
20
2.3
Probability Mass Function
For a discrete random variable X and x ∈ R, we define the probability mass function (pmf),
p X ( x ) (or just p( x )) as
p X ( x ) = PX ( X = x )
pmf Properties
To check we have a valid pmf, we need to make sure the following conditions hold.
If X can take values X = { x1 , x2 , . . .} then we must have:
1. 0 ≤ p X ( x ) ≤ 1, ∀ x ∈ R;
2.
∑
p X ( x ) = 1.
x ∈X
0.00
0.05
p(x)
0.10
0.15
Example: Poisson(5) pmf
0
5
10
15
20
x
Knowing either the pmf or cdf of a discrete random variable characterises its probability
distribution.
That is, from the pmf we can derive the cdf, and vice versa.
p ( x i ) = F ( x i ) − F ( x i −1 ),
i
F ( xi ) =
∑ p ( x j ).
j =1
6
Links with Statistics
We can now see the first links between the numerical summaries and graphical displays
we saw in earlier lectures and probability theory:
We can often think of a set of data ( x1 , x2 , . . . , xn ) as n realisations of a random variable X
defined on an underlying population for the data.
• Recall the normalised frequency counts we considered for a set of data, known as the
empirical probability mass function. This can be seen as an empirical estimate for the pmf
of their underlying population.
• Also recall the empirical cumulative distribution function. This too is an empirical estimate, but for the cdf of the underlying population.
7
3
Mean and Variance
3.1
Expectation
E( X )
For a discrete random variable X we define the expectation of X,
EX ( X ) =
∑ xpX (x).
x
EX ( X ) (often just written E( X ) or even µ X ) is also referred to as the mean of X.
It gives a weighted average of the possible values of the random variable X, with the
weights given by the probabilities of each outcome.
Examples
1. If X is a r.v. taking the integer value scored on a single roll of a fair die, then
6
E( X ) =
∑ xp(x)
x =1
1
1
1
1
1
21
1
= 3.5.
= 1. + 2. + 3. + 4. + 5. + 6. =
6
6
6
6
6
6
6
2. If now X is a score from a student answering a single multiple choice question with four
options, with 3 marks awarded for a correct answer, -1 for a wrong answer and 0 for no
answer, what is the expected value if they answer at random?
1
3
E( X ) = 3.P(Correct) + (−1).P(Incorrect) = 3. − 1. = 0.
4
4
E{ g( X )}
More generally, for a function of interest g : R → R of the random variable X, first notice
that the composition g( X ),
g( X )(s) = ( g ◦ X )(s)
is also a random variable. It follows that
EX { g( X )} =
∑ g( x ) p X ( x )
(1)
x
Linearity of Expectation
Consider the linear function g( X ) = aX + b for constants a, b ∈ R. We can see from (1) that
EX ( aX + b) =
∑(ax + b) pX (x)
x
= a ∑ xp X ( x ) + b ∑ p X ( x )
x
x
and since ∑ x xp X ( x ) = E( X ) and ∑ x p X ( x ) = 1 we have
8
E( aX + b) = aE( X ) + b,
∀ a, b ∈ R.
It is equally easy to check that for g, h : R → R, we have
E{ g( X ) + h( X )} = E{ g( X )} + E{ h( X )}.
3.2
Variance, Standard Deviation and Skewness
Var( X )
Consider another special case of g( X ), namely
g( X ) = { X − E( X )}2 .
The expectation of this function wrt PX gives a measure of dispersion or variability of the
random variable X around its mean, called the variance and denoted VarX ( X ) (or sometimes
σX2 ):
VarX ( X ) = EX [{ X − EX ( X )}2 ].
We can expand the expression { X − E( X )}2 and exploit the linearity of expectation to get
an alternative formula for the variance.
{ X − E( X )}2 = X 2 − 2E( X ) X + {E( X )}2
⇒ Var( X ) = E[ X 2 − {2E( X )} X + {E( X )}2 ]
= E( X 2 ) − 2E( X )E( X ) + {E( X )}2
and hence
Var( X ) = E( X 2 ) − {E( X )}2 .
Variance of a Linear Function of a Random Variable
We saw earlier that for constants a, b ∈ R the linear combination aX + b had expectation
aE( X ) + b. What about the variance?
It is easy to show that the corresponding result is
Var( aX + b) = a2 Var( X ),
∀ a, b ∈ R.
sd( X )
The standard deviation of a random variable X, written sdX ( X ) (or sometimes σX ), is the
square root of the variance.
sdX ( X ) =
q
9
VarX ( X ).
Skewness
The skewness (γ1 ) of a discrete random variable X is given by
γ1 =
EX [{ X − EX ( X )}3 ]
.
sdX ( X )3
That is, if µ = E( X ) and σ = sd( X ),
γ1 =
E[( X − µ)3 ]
.
σ3
Examples
1. If X is a r.v. taking the integer value scored with a single roll of a fair die, then
6
Var( X ) =
∑ x2 p(x) − 3.52
x =1
1
1
1
= 12 . + 22 . + . . . + 62 . − 3.52 = 1.25.
6
6
6
2. If now X is a score from a student answering a single multiple choice question with four
options, with 3 marks awarded for a correct answer, -1 for a wrong answer and 0 for no
answer, what is the standard deviation if they answer at random?
1
3
E( X 2 ) = 32 .P(Correct) + (−1)2 .P(Incorrect) = 9. + 1. = 3
4
4
p
√
2
⇒ sd( X ) = 3 − 0 = 3.
3.3
Comments
Links with Statistics
We have met three important quantities for a random variable, defined through expectation
- the mean µ, the variance σ2 and the standard deviation σ.
Again we can see a duality with the corresponding numerical summaries for data which
we met - the sample mean x̄, the sample variance s2 and the sample standard deviation s.
The duality is this: If we were to consider the data sample as the population and draw a
random member from that sample as a random variable, this r.v. would have cdf Fn ( x ), the
empirical cdf. The mean of the r.v. µ = x̄, variance σ2 = s2 and standard deviation σ = s.
3.4
Sums of Random Variables
Expectation of Sums of Random Variables
Let X1 , X2 , . . . , Xn be n random variables, perhaps with different distributions and not necessarily independent.
n
Let Sn =
∑ Xi be the sum of those variables, and
i =1
Then the mean of Sn is given by
10
Sn
be their average.
n
n
∑ E ( Xi ) ,
E( Sn ) =
E
i =1
Sn
n
∑in=1 E( Xi )
.
n
=
Variance of Sums of Random Variables
However, for the variance of Sn , only if X1 , X2 , . . . , Xn are independent do we have
n
Var(Sn ) =
∑ Var(Xi ),
Var
i =1
Sn
n
=
∑in=1 Var( Xi )
.
n2
So if X1 , X2 , . . . , Xn are independent and identically distributed with E( Xi ) = µ X and
Var( Xi ) = σX2 we get
E
Sn
n
= µX ,
Var
11
Sn
n
=
σX2
.
n
4
4.1
Discrete Distributions
Bernoulli, Binomial and Geometric Distributions
Bernoulli( p)
Consider an experiment with only two possible outcomes, encoded as a random variable
X taking value 1, with probability p, or 0, with probability (1 − p), accordingly.
(Ex.: Tossing a coin, X = 1 for a head, X = 0 for tails, p = 21 .)
Then we say X ∼ Bernoulli( p) and note the pmf to be
p ( x ) = p x (1 − p )1− x ,
x = 0, 1.
Using the formulae for mean and variance, it follows that
µ = p,
σ 2 = p (1 − p ).
Example: Bernoulli(¼) pmf
Binomial(n, p)
Consider n identical, independent Bernoulli( p) trials X1 , . . . , Xn .
Let X = ∑in=1 Xi be the total number of 1s observed in the n trials.
(Ex.: Tossing a coin n times, X is the number of heads obtained, p = 12 .)
Then X is a random variable taking values in {0, 1, 2, . . . , n}, and we say X ∼ Binomial(n, p).
From the Binomial Theorem we find the pmf to be
p( x ) = (nx) p x (1 − p)n− x ,
x = 0, 1, 2, . . . , n.
n
n!
To calculate the Binomial pmf we recall that
=
and x! =
x
x!(n − x )!
x
∏ i. (Note 0! = 1.)
i =1
It can be shown, either directly from the pmf or from the results for sums of random variables, that the mean and variance are
µ = np,
σ2 = np(1 − p).
Similarly, the skewness is given by
γ1 = p
1 − 2p
np(1 − p)
12
.
Example: Binomial(20,¼) pmf
Example
Suppose that 10 users are authorised to use a particular computer system, and that the
system collapses if 7 or more users attempt to log on simultaneously. Suppose that each user
has the same probability p = 0.2 of wishing to log on in each hour.
What is the probability that the system will crash in a given hour?
Solution: The probability that exactly x users will want to log on in any hour is given by
Binomial(n, p) = Binomial(10, 0.2).
Hence the probability of 7 or more users wishing to log on in any hour is
p(7) + p(8) + p(9) + p(10)
10
10
7
3
0.210 0.80
=
0.2 0.8 + . . . +
7
10
= 0.00086.
Two more examples
• A manufacturing plant produces chips with a defect rate of 10%. The quality control
procedure consists of checking samples of size 50. Then the distribution of the number
of defectives is expected to be Binomial(50,0.1).
• When transmitting binary digits through a communication channel, the number of digits
received correctly out of n transmitted digits, can be modelled by a Binomial(n, p), where
p is the probability that a digit is transmitted incorrectly.
Note: Recall the independence condition necessary for these models to be reasonable.
Geometric( p)
Consider a potentially infinite sequence of independent Bernoulli( p) random variables
X1 , X2 , . . ..
Suppose we define a quantity X by
X = min{i | Xi = 1}
13
to be the index of the first Bernoulli trial to result in a 1.
(Ex.: Tossing a coin, X is the number of tosses until the first head is obtained, p = 12 .)
Then X is a random variable taking values in Z+ = {1, 2, . . .}, and we say X ∼ Geometric( p).
Clearly the pmf is given by
p ( x ) = p (1 − p ) x −1 ,
x = 1, 2, . . ..
The mean and variance are
µ=
1
,
p
1− p
.
p2
σ2 =
The skewness is given by
2− p
γ1 = p
,
1− p
and so is always positive.
Example: Geometric(¼) pmf
Alternative Formulation
If X ∼ Geometric( p), let us consider Y = X − 1.
Then Y is a random variable taking values in N = {0, 1, 2, . . .}, and corresponds to the
number of independent Bernoulli( p) trials before we obtain our first 1. (Some texts refer to this
as the Geometric distribution.)
Note we have pmf
pY ( y ) = p ( 1 − p ) y ,
y = 0, 1, 2, . . .,
and the mean becomes
µY =
1− p
.
p
while the variance and skewness are unaffected by the shift.
14
Example
Suppose people have problems logging onto a particular website once every 5 attempts, on
average.
1. Assuming the attempts are independent, what is the probability that an individual will
not succeed until the 4th ?
4
p = = 0.8. p(4) = (1 − p)3 p = 0.23 0.8 = 0.0064.
5
2. On average, how many trials must one make until succeeding?
1
5
Mean= = = 1.25.
p
4
3. What’s the probability that the first successful attempt is the 7th or later?
p (7) + p (8) + p (9) + . . . =
p (1 − p )6
= (1 − p)6 = 0.26 .
1 − (1 − p )
Example (contd. from Binomial)
Again suppose that 10 users are authorised to use a particular computer system, and that
the system collapses if 7 or more users attempt to log on simultaneously. Suppose that each
user has the same probability p = 0.2 of wishing to log on in each hour.
Using the Binomial distribution we found the probability that the system will crash in any
given hour to be 0.00086.
Using the Geometric distribution formulae, we are able to answer questions such as: On
average, after how many hours will the system crash?
Mean =
1
p
=
1
0.00086
= 1163 hours.
Example: Mad(?) Dictators and Birth Control
A dictator, keen to maximise the ratio of males to females in his country (so he could build
up his all male army) ordered that each couple should keep having children until a boy was
born and then stop.
Calculate the number expected number of boys that a couple will have, and the expected
number of girls, given that P(boy)=½.
Assume for simplicity that each couple can have arbitrarily many children (although this
is not necessary to get the following results). Then since each couple stops when 1 boy is born,
the expected number of boys per couple is 1.
On the other hand, if Y is the number of girls given birth to by a couple, Y clearly follows
the alternative formulation for the Geometric(½) distribution.
So the expected number of girls for a couple is
15
1−
1
2
1
2
= 1.
4.2
Poisson Distribution
Poi(λ)
Let X be a random variable on N = {0, 1, 2, . . .} with pmf
p( x ) =
e−λ λ x
,
x!
x = 0, 1, 2, . . .,
for some λ > 0.
Then X is said to follow a Poisson distribution with rate parameter λ and we write X ∼
Poi(λ).
Poisson random variables are concerned with the number of random events occurring per
unit of time or space, when there is a constant underlying probability ‘rate’ of events occurring
across this unit.
Examples
• the number of minor car crashes per day in the U.K.;
• the number of mistakes on each of my slides;
• the number of potholes in each mile of road;
• the number of jobs which arrive at a database server per hour;
• the number of particles emitted by a radioactive substance in a given time.
An interesting property of the Poisson distribution is that it has equal mean and variance,
namely
σ2 = λ.
µ = λ,
The skewness is given by
1
γ1 = √ ,
λ
so is always positive but decreasing as λ ↑.
Example: Poi(5) pmf (again)
16
Poisson Approximation to the Binomial
Notice the similarity between the pmf plots we’ve seen for Binomial(20,¼) and Poi(5).
It can be shown that for Binomial(n, p), when p is small and n is large, this distribution can
be well approximated by the Poisson distribution with rate parameter np, Poi(np).
p in the above is not small, we would typically prefer p < 0.1 for the approximation to be
useful.
The usefulness of this approximation is in using probability tables; tabulating a single
Poisson(λ) distribution encompasses an infinite number of possible corresponding Binomial
distributions, Binomial(n, λn ).
Ex.
A manufacturer produces VLSI chips, of which 1% are defective. Find the probability that
in a box of 100 chips none are defective.
We want p(0) from Binomial(100,0.01). Since n is large and p is small, we can approximate
this distribution by Poi(100 × 0.01) ≡ Poi(1).
Then p(0) ≈
e −1 λ 0
= 0.3679.
0!
Fitting a Poisson Distribution to Data
Ex.
The number of particles emitted by a radioactive substance which reached a Geiger counter
was measured for 2608 time intervals, each of length 7.5 seconds.
The (real) data are given in the table below:
x
nx
0
57
1
203
2
383
3
525
4
532
5
408
6
273
7
139
8
45
9
27
≥10
16
Do these data correspond to 2608 independent observations of an identical Poisson random
variable?
The total number of particles, ∑ x xn x , is 10,094, and the total number of intervals observed,
n = ∑ x n x , is 2608, so that the average number reaching the counter in an interval is 10094
2608 =
3.870.
Since the mean of Poi(λ) is λ, we can try setting λ = 3.87 and see how well this fits the
data.
For example, considering the case x = 0, for a single experiment interval the probability of
−3.87
0
observing 0 particles would be p(0) = e 0!3.87 = 0.02086. So over n = 2608 repetitions, our
(Binomial) expectation of the number of 0 counts would be n × p(0) = 54.4.
Similarly for x = 1, 2, . . ., we obtain the following table of expected values from the Poi(3.87)
model:
x
O(n x )
E(n x )
0
57
54.4
1
203
210.5
2
383
407.4
3
525
525.5
4
532
508.4
(O=Observed, E=Expected).
17
5
408
393.5
6
273
253.8
7
139
140.3
8
45
67.9
9
27
29.2
≥10
16
17.1
The two sets of numbers appear sufficiently close to suggest the Poisson approximation is
a good one. Later, when we come to look at hypothesis testing, we will see how to make such
judgements quantitatively.
4.3
Discrete Uniform Distribution
U({1, 2, . . . , n})
Let X be a random variable on {1, 2, . . . , n} with pmf
p( x ) =
1
,
n
x = 1, 2, . . . , n.
Then X is said to follow a discrete uniform distribution and we write X ∼ U({1, 2, . . . , n}).
The mean and variance are
µ=
n+1
,
2
σ2 =
and the skewness is clearly zero.
18
n2 − 1
.
12