Download PROBABILITY DISTRIBUTIONS A discrete random variable X takes

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Taylor's law wikipedia , lookup

Probability amplitude wikipedia , lookup

Law of large numbers wikipedia , lookup

Transcript
PROBABILITY DISTRIBUTIONS
A discrete random variable X takes value x with probability p(x). The probability
distn specifies p(x) for each possible value x. For example, if I flip a fair coin twice,
the number of heads (X) is a random variable with probability distn
Number of heads
Probability
0
1
2
1/4
1/2
1/4
MEAN AND VARIANCE OF A DISTRIBUTION
The expectation, or mean value of X is
E (X) =
∑ xp(x)
Similarly, E (X2)=∑ x2 p (x) , etc. The variance of X is
var(X) = E(X – m)²,
where m = E(X). An alternative formula for the variance is E(X²) – m².
For the coin example,
E(X) = 0 × (1/4) + 1 × (1/2) + 2 × (1/4) = 1,
var(X) = (-1)² × (1/4) + 0² × (1/2) + 1² × (1/4) = 1/2
The mean of the distn is 1. This is also the mode of the distn (the most probable
value).
The mean is a measure of location or 'central tendency'. Variance measures the
spread (dispersion) of the distn.
The expectation terminology dates from 18th century games of chance. Consider
the following game: you pay me $1 (the stake). A fair coin is then flipped. If it falls
heads, I return the stake and pay an additional $1. If tails, I keep the stake and
pay nothing. Is this a fair game?
Your return is a random variable, determined by the flip of the coin. The expected
return is 2 x 0.5 + 0 x 0.5 = 1. This is exactly equal to the stake: the game is fair.
BINOMIAL DISTRIBUTION
A coin with probability p of landing heads and q of landing tails is flipped three
times.
Sequence
Probability
Heads
Probability
TTT
q³
0
q³
HTT
THT
TTH
pq²
pq²
pq²
1
3pq²
HHT
HTH
THH
p²q
p²q
p²q
2
3p²q
HHH
p³
3
p³
Note that q³ + 3pq² + 3p²q + p³ = (q + p)³ = 1.
Now consider what happens when the coin is flipped n times. The probability of
obtaining any particular sequence of x heads and n – x tails is the product of n
numbers, x of which are equal to p, and n – x of which are equal to q. The number
of such sequences is the number of ways of choosing x positions from n for the
heads. Therefore, probability of x heads is
()
x n−x
p (x) = n p q
(x=0, … , n)
x
For the binomial distn, mean and variance are
E(X) = np, var(X) = npq.
Note also that E(X/n) = p, var(X/n) = pq/n.
Example: Two black-coated animals both carry the recessive allele for red coat
colour. The probability that they produce a red calf is 1/4. This is independently true
for each offspring. Among 4 progeny, the number of red calves has a binomial distn
with n = 4 and p = 1/4.
# red calves
Probability
0
0.3164
1
0.4219
2
0.2109
3
0.0469
4
0.0039
The mean number of red calves is
0 x 0.3164 + 1 x 0.4219 + … = 1, or more simply 4 x 0.25 = 1.
A discrete distribution is usually displayed as a bar chart, with the area of the bar
representing probability. The example below shows the binomial distribution with
n = 16 and p = 1/4. This is a skew distn: the right tail is longer than the left tail.
MULTINOMIAL DISTRIBUTION
The binomial distn arises when there are n trials, and each trial has two possible
outcomes. (A trial is an opportunity for the event to happen.) The multinomial distn
arises when each trial has more than two possible outcomes. For example, a sixsided “die” is thrown sixty times, with the following results:
Score
Frequency
1
8
2
7
3
12
4
10
5
11
6
12
Total
60
Is the die fair? Or is there evidence of bias in favour of certain scores? You will
learn how to answer such questions later in the course.
HYPERGEOMETRIC DISTRIBUTION
A box contains r red balls and b black balls (r + b = N). A random sample of n balls
is removed from the box, without replacement. The probability that there are x red
balls and y black balls in the sample (x + y = n) is
(rx )(by )
(Nn)
The denominator is the number of possible samples. The two numbers in the
numerator are (i) the number of ways the x red balls can be chosen, (ii) the number
of ways that the y black balls can be chosen.
Example: a lottery.
A box holds balls numbered 1 to 59. You choose six of these (red). Think of the
remaining 53 balls as black. A random sample of six balls is removed from the box.
What is the probability that x of the chosen balls are in the sample, for x = 0 to 6?
# balls
Probability
0
0.5095
1
0.3821
2
0.0975
3
0.0104
more than 3
0.0005
The chance of “hitting the jackpot” (all 6 chosen balls in the sample) is about 1 in
45 million.
Sampling with and without replacement
If we sample without replacement, the number of red balls in the sample has a
hypergeometric distn. Sampling with replacement, the distn is binomial with index n
and parameter p = r/N.
For the hypergeometric distn,
E(X) = np,
var(X) = npq(N – n)/(N – 1)
where p = r/N and q = b/N. The mean of the hypergeometric distn is the same as
the mean of the binomial distn we would obtain if sampling with replacement, but
the variance is smaller by a factor (N – n)/(N – 1).
In general, sampling without replacement generates more complicated probability
distns than sampling with replacement, but when the size of the sample (n) is small
relative to the size of the population (N) the two distns are not very different. When
sampling without replacement, we sometimes use the simpler results appropriate to
sampling with replacement as an approximation.
POISSON DISTRIBUTION
The Poisson distn arises as the limit of the binomial distn as n tends to infinity and p
tends to zero while the product np = m remains constant (distn of rare events).
At the limit, the binomial probability becomes
x
m exp(−m)
x!
Possible values are x = 0, 1, 2,
…
(to infinity, theoretically).
This is the Poisson distn with parameter m. The mean and variance of the distn are
both equal to m.
Exponential function
exp(x) is limit as n tends to infinity of the sequence
n
(1 + x/n)
In this form we can see the connection between binomial and Poisson distributions.
The alternative definition as the sum of the infinite series
2
3
1 + x + x + x + ⋯
2!
3!
shows that Poisson probabilities sum to 1.
Example: birthdays
Each member of a group of 365 people is asked whether today is his or her
birthday. Let Y equal the number of people who say “yes”. The probability that Y
equals zero is (1 – 1/365) raised to the power 365, approximately exp( – 1) = 0.368.
More generally we can show that (approximately)
−1
Pr (Y = y) = e / y!
(Y has a Poisson distn with mean 1.)
# birthdays
Probability
0
0.368
1
0.368
2
0.184
3
0.061
4 (or more)
0.019
THE POISSON PROCESS
Events occur singly, at random, at times or locations T1,T2, … , such that the
numbers of events in non-overlapping intervals are independent random variables.
Probability that n events occur in the time interval (0,t) is
n
(λ t) exp(−λ t)/n!
The number of events has a Poisson distn with mean m =λt. The “rate” parameter λ
measures the average rate at which events occur.
Example: vehicle arrivals
Assume that vehicles arrive at a census point in a Poisson process. If, on average,
one vehicle passes the fixed point every 20 seconds, what is the probability that
there will be at most 2 arrivals in a two-minute period?
Average arrival rate (λ) is 3 per minute. Number of events X in a period of length
2 minutes has a Poisson distn with mean 6. Then
Pr (X < 3) = exp (−6) [1+ 6 + 18]= 0.062
(Or use Cambridge Tables, No. 2).
Poisson distribution: examples
As the distn of rare events:
1. Number in a large group with birthday on a particular day;
2. Number of misprints per page of a book;
3. Annual number of deaths from horse-kicks in the Prussian cavalry
Arising from the Poisson process:
1. Radioactive disintegrations over time;
2. Number of bacteria per unit area of a petri dish;
3. Recombination events in a length of chromosome;
4. Vehicles passing a fixed point.
CONTINUOUS DISTRIBUTIONS
The r.v.s we have looked at so far have taken integer values 0,1,2, etc. They have
arisen as a result of counting: how many times did a particular event occur? We call
such r.v.s “discrete”. A “continuous” r.v. has values that are not restricted to a
discrete set: if x1 and x2 are two possible values of the r.v., any value between x1
and x2 is also possible.
EXPONENTIAL DISTRIBUTION
A simple example of a continuous r.v. is the time T to the first event in a Poisson
process of rate λ. Let Nt denote the number of events between time 0 and time t.
This has a Poisson distn with mean λt, so
Pr (T > t )= Pr (Nt = 0)=exp (−λ t) .
The distn of T is called the exponential distn. It can be used to describe the lengths
of chromosome segments between recombination events, the times between random
events (e.g. vehicles passing a road census point), or the lifetimes of components
subject to failure at a constant rate. The mean of the distn is 1/λ.
The cumulative distribution function of a continuous r.v. X is defined in exactly the
same way as for a discrete r.v., but the discrete probability function is replaced by
a probability density function f(x), such that the probability that X lies in a short
interval (x, x + h) is approximately hf(x) for small h.
The probability that a<X<b is given by the integral
b
Pr (a < X < b)=∫ f (x) dx
a
With a continuous distn, we can only attach a probability to an interval of values,
not to a single value.
The two functions F(x) (cumulative probability function) and f(x) (probability density
function) are related: F(x) is the area under the curve represented by the function f
to the left of x. Thus the probability that a < X < b, F(b) – F(a), is the area under the
probability density function to the left of b and to the right of a.
When looking at a probability density function, it is the area under the curve that is
important, not the value of the function.
THE NORMAL DISTRIBUTION
The normal, or Gaussian, density function is
( )
1 exp − x
2
√2 π
2
Plotting this curve shows a characteristic bell shape, symmetric about zero. A
random variable with this density function is said to be normally distributed, or
Gaussian.
The mean of the distn is 0 and the variance is 1. This is the “standard” form of the
normal distn.
The notation N(m,σ²) denotes a normal distn with mean m and variance σ². If the
distn of X is N(m,σ²), the standardized value (X – m)/σ has a standard normal
distn.
KURTOSIS
Kurtosis measures the thickness of the tails of a distribution. The standard for
comparison is the normal distribution: positive kurtosis indicates a distribution with
thicker tails than the normal distribution.
CENTRAL LIMIT THEOREM
The distn of the average of a sample of size n from almost any distn tends to a
normal distn as n tends to infinity. This result can explain why a trait is normally
distributed (it may be determined by many genes, each of which individually has a
small effect). For example, the distn of heights or weights in a population often
follow a normal distn (at least approximately).
Example: let X be binomial (n, p). The mean and variance of X are np and npq,
where q = 1 – p. For large n, the probability that X lies between r and s (inclusive) is
approximated by the probability that a standard normal r.v. lies between limits
(r − np−0.5)/ √ npq and (s−np + 0.5)/ √ npq
Example: if X is binomial with n=16, p = 1/2, what is Pr (6≤ X ≤10) ?
This is approximated by the probability (0.79) that a standard normal lies between
the limits ±1.25. Exact answer (0.79 to two decimal places) is the sum of five
binomial probabilities.
CUMULATIVE PROBABILITIES AND QUANTILES
We discuss this first for continuous distns (e.g., exponential, normal). The
cumulative distn function for random variable X is F (x )= Pr (X ≤x ) . This is an
increasing function of x with a value between 0 and 1. The corresponding quantile
function Q(p) is the value of x such that F(x) = p, so that Q(p) = x and F(x) = p are
equivalent statements.
The first, second and third quartiles of the distn are Q(0.25), Q(0.5), and Q(0.75).
The second quartile Q(0.5) is also known as the median. Total probability is split by
the median into two equal parts. The three quartiles split the probability into 4 equal
parts. Similarly, four quintiles split into 5 equal parts, nine deciles into 10 equal
parts, 99 percentiles into 100 equal parts. The generic term for medians, quartiles,
etc is 'quantile'. Note that there may be several names for the same quantile. For
example, Q(0.75) could be referred to as third quartile, upper quartile, 75th
percentile, or the 75% point of the distribution. Or it could be called the upper 25%
point, 'upper' indicating that the measured probability is in the upper rather than
the lower tail of the distn.
For some distns, the calculation of F(x) or Q(p) is straightforward. For the unit
exponential distn, F(x) = 1 – exp( --x) and Q(p) = – loge(1 -- p). For most other
distns, the calculation of F(x) requires laborious summation or numerical
integration. E.g., for the standard normal,
x
F (x)= ∫
−∞
1 exp (− u 2 ) du
2
√2 π
Hence the need for statistical tables.
TABLES OF THE NORMAL DISTRIBUTION
Probabilities for the normal and other continuous distns are calculated as areas
under the density function. Statistical tables give cumulative probabilities for the
standard normal distn with zero mean and unit variance.
If necessary, standardize X before referring to tables. For example, if X is N(m,σ²),
what is Pr(m – σ < X < m + σ)?
This is the same as the probability that –1 < (X – m)/σ < +1, the probability that a
standard normal r.v. deviates from its mean value by less than one standard
deviation. Tables of the normal distn give F(1) = 0.8413, so required probability is
F(1) – F(–1) = 0.8413 – (1 – 0.8413) = 0.6826 (draw a picture).
QUANTILES OF THE NORMAL DISTRIBUTION
Here are some frequently used percentiles of the standard normal distn:
Percent
Percentile
90
1.2816
95
1.6449
97.5
1.9600
99
2.3263
If X is N(m,σ²), the quantile is m + xσ, where x is the corresponding quantile of the
standard normal. For example, a r.v. X is normally distributed with mean 100 and
standard deviation 10. What is the upper 2.5% point of its distn? The upper 2.5%
point of the standard normal distn is 1.96 (the probability that a standard normal
exceeds 1.96 is 0.025). The upper 2.5% point of the distn of X is
100 + 10×1.96 = 119.6
TABLES FOR DISCRETE DISTRIBUTIONS
Tables are also required for discrete distns (binomial, hypergeometric, Poisson, etc),
because F(x) may be the sum of a very large number of probabilities. We have to be
slightly more careful when dealing with a discrete r.v. For a purely continuous r.v.,
Pr (X ≤x) and Pr(X < x) are equal, but this is not necessarily the case for a
discrete r.v. For example, suppose X is discrete, taking values 0, 1, 2, etc. If x is
one of the possible values of X, the two probabilities differ by Pr(X = x).
The definition of the median (and other quantiles) is trickier for a discrete distn.
E.g., suppose X takes values 1...4 with equal probability. The cumulative probability
function is a step function with value 1/4 between 1 and 2, 1/2 between 2 and 3,
3/4 between 3 and 4. The median could be defined to be any value between 2 and 3
(e.g. 2.5).
R FUNCTIONS pnorm( ) and qnorm( )
In R, pnorm(x) calculates F(x) and qnorm(p) calculates Q(p) for the standard normal
distn. dnorm(x) gives the probability density function (useful for plotting the normal
curve, but not usually required for probability calculations). There are similar
functions for other distns (dbinom, pbinom, etc).
Examples
1. Number of red calves among four progeny? dbinom(0:4, 4, 1/4)
2. How many birthdays in a group of 365? dpois(0:3, lambda = 1)
3. Vehicle arrivals? ppois(2, lambda = 6)
4. Probability that standard normal deviates from its mean by less than 1 s.d.?
pnorm(1) – pnorm(-1), or 2*pnorm(1) – 1
5. Upper 2.5% point of the standard normal?
qnorm(0.975) or qnorm(0.025, lower.tail = FALSE)
BIVARIATE DISTRIBUTIONS
If two random variables Y1 and Y2 are independently distributed, the conditional
distn of Y1 does not depend on the value of Y2. An example of two variables which
are not independently distributed are height and weight: the distn of weights among
a sub-population of short people differs from the distn among a sub-population of
tall people. There is a positive covariance (or correlation) between height and
weight.
COVARIANCE
The covariance between r.v.s Y1 and Y2 is
cov(Y1,Y2) = E[(Y1 – m1)(Y2 – m2 )] = E(Y1Y2) – m1m2
The correlation between Y1 and Y2 is a scaled version of the covariance which
removes dependence on units of measurement:
cor(Y1,Y2) = cov(Y1,Y2)/σ1σ2
where σ1 and σ2 are the standard deviations of Y1 and Y2.
Covariance between r.v.s has an effect on the variance of the sum:
var(Y1 + Y2) = var(Y1) + var(Y2) + 2 cov(Y1,Y2).
More generally, the variance of a sum of r.v.s is the sum of their variances plus
twice the sum of the pairwise covariances.
The variance of the sum of independent r.v.s is the sum of their variances.
There are bivariate versions of many distns, but the bivariate normal distn is the
only example we deal with on this course.
PROBABILITY (DENSITY) FUNCTION
If Y1 and Y2 are discrete r.v.s, the probability function of their joint distn is
f(y1, y2) = Pr(Y1 = y1, Y2 = y2)
for all possible pairs of values (y1, y2). If the distn is continuous, the probability
density function f(y1, y2) is such that the probability that the random point (Y1,Y2)
falls in a small rectangle with corners at (y1, y2) and (y1 + h, y2 + k) is approximately
hk f(y1, y2).
BIVARIATE NORMAL DISTRIBUTION
The probability density function f(y1, y2) of the standard bivariate normal distn is
proportional to
exp
[
2
2
−(y 1 + y2 −2 ρ y1 y 2 )
2
]
The standard distn has zero means and unit variances. The general form of the
distn allows arbitrary mean values and variances. The random vector (Y1,Y2) then
has mean vector (m1, m2) and covariance matrix
(
2
σ1
ρ σ1 σ2
2
ρ σ1 σ 2
σ2
)
The figure below shows the p.d.f. of a bivariate normal distn with equal variances
and a correlation of ρ = 0.5 between the two variables.
MARGINAL AND CONDITIONAL DISTRIBUTIONS
Generating a pair of values from a bivariate distn can be done in two stages.
1. Select Y1 from its marginal distn.
2. Select Y2 from the conditional distn of Y2, given Y1.
For the standard bivariate normal distn,
Marginal distn of Y1 is N(0,1),
Conditional distn of Y2, given Y1, is normal with mean and variance
E(Y2|Y1) = ρY1, var(Y2|Y1) = (1 – ρ2)
The figure below shows the same example of a bivariate normal distn using
probability contours. Also shown are the two regression (or prediction) lines,
representing E(Y2| Y1) and E(Y1| Y2). At zero correlation, these line are at right
angles, and with perfect correlation, the two lines merge into one.
MULTIVARIATE NORMAL DISTRIBUTION
The bivariate normal distn generalizes to the multivariate normal distn for three or
more variables. Covariances between pairs of variables are shown as a covariance
matrix with variances on the diagonal and covariances off-diagonal.
SUMS OF SQUARES, DEGREES OF FREEDOM
The sum of squared deviations ∑ (Yi − Ȳ)2 is sometimes called the corrected sum of
squares and written as Syy. For sample size n = 2, it can be written in two different
ways as
2
(Y 1 − Ȳ)
2
+ (Y 2 − Ȳ)
2
= (1 /2) (Y 1 − Y 2 )
In this case (when n = 2), Syy appears to be the sum of two squares, but can be
reduced to a single square by algebraic manipulation. When n = 2, we say Syy has 1
'degree of freedom'. In the same way, the corrected sum of squares for a sample of
size n can be written as the sum of n – 1 squares, and has n – 1 degrees of
freedom.
It can be shown that Syy has expectation (n – 1)σ². An unbiased estimate of σ² is
obtained by dividing Syy by ( n – 1). This is why the formula for the sample variance
has n – 1 on the denominator rather than n.
SAMPLING DISTRIBUTIONS
The chi-squared, t, and F dists arise when sampling from the normal distn.
Assume that Y 1 … Y n is a random sample from N(m,σ²). Define sample mean Ȳ
and sample variance S² as
−1
Ȳ =n
n
∑ Yi
i=1
2
−1
, S =(n −1)
n
∑ (Yi − Ȳ)2
i=1
a) The distn of the sample mean Ȳ is N(m,σ²/n), and the standardized value of the
sample mean
√ n ( Ȳ −m )/σ
has a standard normal distn.
b) The distribution of (n – 1)S²/σ² is chi-squared with n – 1 degrees of freedom.
This is the distn of the sum of squares of n – 1 independent standard normal
variables.
c) The distn of √ n ( Ȳ −m)/S is called Student's t distn with n – 1 degrees of
freedom. The shape of this distn depends on n – 1, the number of degrees of
freedom associated with S². When this number is large, the t distn is close to
standard normal. When the number is small, the t distn has thicker tails (shows
positive kurtosis). The quantiles of the t distribution are always further from zero
than the corresponding quantiles of the standard normal distribution.
d) When S21 and S22 are independent estimates of σ², with degrees of freedom n1
and n2, the distn of S21 /S22 is called the F distn with n1 and n2 degrees of freedom.
For example, the sample variance S² is an estimate of σ², with n – 1 degrees of
freedom, and so also is n (Ȳ − m)2 , with 1 degree of freedom. The ratio
n (Ȳ − m)2 /S2 has an F distn with 1 and n – 1 d.f.
There are various relationships among the three distns. For example, when T has a
t distn with ν d.f., T2 has an F distn with 1 and ν d.f.
R functions pchisq( ), pt( ), pf( ), etc are available for calculating cumulative
probabilities and qchisq( ), etc, for quantiles. Often these are not needed because
the calculation is done internally by a function such as lm( ) or t.test( ).
GENERALISATION
The corrected sums of squares can be regarded as a 'residual' sum of squares,
after adjusting each observation by subtracting the sample mean. In simple linear
regression, a residual sum of squares for Y is corrected for the mean and also for
regression on an explanatory variable X. In this case, the number of degrees of
freedom associated with the residual sum of squares is n – 2, where n is sample
size and 2 is the number of terms fitted (intercept, and X). In multiple regression
and anova, the residual sum of squares has n – k degrees of freedom, where k is
the total number of terms fitted. (For example, in one-way anova, k is the number of
groups.) In general, the residual variance has expectation (n – k)σ², and an
unbiased estimate of σ² is obtained by dividing the residual variance by (n – k). In
the context of multiple regression or anova, this estimate is called the residual mean
square.