Download STATISTICAL LABORATORY, April 23rd, 2010 UNIVARIATE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
STATISTICAL LABORATORY, April 23rd, 2010
UNIVARIATE PROBABILITY DISTRIBUTIONS
Mario Romanazzi
1
POISSON DISTRIBUTION
Ex1 Suppose that in a city, the number of suicides can be approximated by a Poisson
process with rate λ = .33 per month. 1) Find the probability of k suicides in a year
for k = 0, 1, .... What is the most probable number of suicides? 2) What is the
probability of two suicides in a week? 3) A suicide is reported in today’s newspaper.
What is the probability that the waiting time for the next suicide is greater than 1
month? 4 months? What is the median of the waiting time? (Rice, 2.32)
1. The number X of suicides in a year has a Poisson distribution, with parameter
12λ = 3.96 and probability function
P (X = k) = exp(−3.96)3.96k /k!, k = 0, 1, ...
We use R to derive the first few values of the probability function.
> tab <- data.frame(0:10, round(dpois(0:10, lambda = 12 * 0.33),
+
3), round(ppois(0:10, lambda = 12 * 0.33), 3))
> names(tab) = c("Values", "Probs", "Cum. Probs")
> tab
Values
1
0
2
1
3
2
4
3
5
4
6
5
7
6
8
7
9
8
10
9
11
10
Probs Cum. Probs
0.019
0.019
0.075
0.095
0.149
0.244
0.197
0.441
0.195
0.637
0.155
0.791
0.102
0.893
0.058
0.951
0.029
0.980
0.013
0.992
0.005
0.997
1
1 POISSON DISTRIBUTION
2
From the previous table, the mode is 3 (meaning that the most probable number
of suicides in a year is 3) and the median is 4.
2. Note that the parameter is = .33/4 = 0.0825.
exp(−0.0825)0.08252 /2! ' 0.00313,
a very low value.
3. The waiting time T (months) between two suicides has an exponential distribution with rate parameter λ = 0.33, meaning that the average time between
two suicides is λ−1 ' 3.030. We use R to answer the questions.
> 1 - pexp(1, rate = 0.33)
[1] 0.7189237
> 1 - pexp(4, rate = 0.33)
[1] 0.2671353
> qexp(0.5, rate = 0.33)
[1] 2.100446
Ex2 Find the probability density for the distance of an event to its nearest neighbour for
a Poisson process in the plane (Rice, 2.42)
Solution. Assume that events are observed in the cartesian plane according to a
Poisson process with parameter λ. This means that the number of events observed
in anyunit square follows a Poisson law. We write P for the reference point (event)
in the plane. The nearest neighbour Q is the nearest point (event), that is, the
observed point with the lowest distance to P . Let D denote the (random) distance
from P to Q and let d > 0 be a fixed positive distance. Moreover, we denote with
C(P, d) the circle centered at P with radius d. Observe that
A : D > d ⇔ B : no points (events) are observed inside C(P, d),
hence P (A) ≡ 1 − FD (d) = P (B) and P (B) can be derived from the Poisson
distribution. Since the area of C(P, d) is πd2 ,
P (B) = exp(−λπd2 ) = 1 − FD (d),
and
FD (d) = 1 − exp(−λπd2 ).
The pdf is the derivative of the previous function with respect to d
fD (d) = 2πλd exp(−λπd2 , d ≥ 0
and fD (d) = 0, identically, when d ≤ 0.
2 GENERAL CONTINUOUS DISTRIBUTIONS
2
3
GENERAL CONTINUOUS DISTRIBUTIONS
Ex1 A line segment of length 1 is cut once at random. What is probability that the
longer piece is more than twice the length of the shorter piece? (Rice, 2.37)
0.5
PDF
1.0
1.5
Solution. Choosing at random the cut point on a length 1 segment amounts to
sample one value X from the uniform distribution on the [0, 1] interval. The required
event is the union of the disjoint events A : X < 1/3 and B : X > 2/3 (see figure
below). Since P (A) = P (B) = 1/3, the probability is 2/3.
0.0
A
B
*
−0.2
0.0
0.2
*
0.4
0.6
0.8
1.0
1.2
Cut Point, x
Ex2 If U is a uniform random variable on [0, 1], what is the distribution of the random
variable X = [nU ], where [t] denotes the greatest integer less than or equal to t?
(Rice, 2.36).
Solution. For n = 1, 2, ..., X = [nU ] is the discrete random variable assuming the n
values 0, 1, ..., n − 1 with equal probabilities 1/n. An example, for n = 3, is shown
in the table.
X
P robability
0
1
2
1/3 1/3 1/3
The proof is easy because
X = i − 1 ⇐⇒
i−1
i
≤ U < , i = 1, 2, ..., n, X = n ⇐⇒ U = 1.
n
n
The last event has zero probability and the others have the same probability 1/n.
2 GENERAL CONTINUOUS DISTRIBUTIONS
4
Ex3 If f and g are densities, show that αf + (1 − α)g is a density, where 0 ≤ α ≤ 1
(Rice, 2.38).
Solution. Put h = αf + (1 − α)g. For all x ∈ R,
h(x) = αf (x) + (1 − α)g(x) ≥ 0
because f (x) ≥ 0, g(x) ≥ 0 and 0 ≤ α ≤ 1. Moreover, the linearity property of
integration implies that the integral of h is equal to 1:
Z ∞
Z ∞
h(x)dx =
(αf (x) + (1 − α)g(x))dx
−∞
−∞
Z ∞
Z ∞
g(x)dx
f (x)dx + (1 − α)
=α
−∞
−∞
= α · 1 + (1 − α) · 1 = 1.
Hence h is a pdf for all values of 0 ≤ α ≤ 1.
Ex4 Suppose that X has the density function f (x) = cx2 for 0 ≤ x ≤ 1 and f (x) = 0
otherwise. 1) Find c. 2) Find the cdf. 3) What is P (.1 ≤ X < .5)? 4) What is the
shortest interval containing 20% of total probability? (Rice, 2.40)
Solution.
1. The numerical value of c is obtained by using the normalization property, that
is, the total area under a density curve is equal to 1.
Z 1
Z ∞
x3 1 c
2
x dx = c[ ]0 = = 1 ⇐⇒ c = 3.
f (x)dx = c
3
3
0
−∞
2. Recall that the cdf is the area allocated on the closed halfline (−∞, x], for any
real number x. If 0 ≤ x ≤ 1,
Z x
Z x
t3
F (x) = P (X ≤ x) =
f (t)dt = 3
t2 dt = 3[ ]x0 = x3 .
3
−∞
0
Obviously, if x < 0, F (x) = 0 and, if x > 1, F (x) = 1. The figure below shows
the plots of the pdf and the cdf.
3. Using the property P (a < X ≤ b) = (F (b) − F (a), the required probability is
F (.5) − F (.1) + P (X = .1) − P (X = .5) = F (.5) − F (.1) = .125 − .001 = .124,
because P (X = x0 ) = 0 for all real numbers x0 in the case of a continuous
distribution.
4. Since the density function is increasing, the shortest interval including p% of
total probability is [x1−p , 1], where x1−p is the (1 − p)-th order quantile of the
distribution. With p = 0.2,
F (x0.8 ) = x30.8 = 0.8 ⇔ x0.8 = 0.81/3 ' 0.928.
3 NORMAL DISTRIBUTION
5
3.0
f(x)=3x^2
2.5
0.0
0.0
0.5
0.5
1.0
1.5
CDF
2.0
1.5
1.0
PDF
2.0
3.0
2.5
F(x)=x^3
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
−0.2
0.0
0.2
x
3
0.4
0.6
0.8
1.0
1.2
x
NORMAL DISTRIBUTION
Ex1 Suppose that in a certain population, individuals’ heights (inches) are approximately
normally distributed with parameters µ = 70 and σ = 3. 1) What proportion of
the population is over 6 feet tall? 2) What is the distribution of heights if they are
expressed in centimeters? In meters? 3) What is the shortest interval containing
90% proportion of the population? (Rice, 1.52)
Solution. The problem involves changing the unit of measurement of the reference
variable. For example, let X denote height in inches and let Y denote height in feet.
Since 1 foot is equal to 12 inches, Y = X/12 or X = 12Y . Since 1 inch is equal to
2.54 centimeters, height measure in centimeters (Z) satisfies Z = 2.54X. All these
transformations are scale transformations, a particular instance of the general linear
transformation. This implies (recall that the normal family is closed under all linear
transformations) that the transformed variable is again normally distributed, with
different parameters.
1. Since 6 feet correspond to 12·6 = 72 inches, the problem is to derive P (X > 72).
Using reduction to the standard normal distribution XST :
P (X > 72) = 1 − FX (72) = 1 − FXST ((72 − 70)/3) = 1 − FXST (2/3) ' 0.2514.
A better approximation is provided by R function pnorm.
> 1 - pnorm(12 * 6, mean = 70, sd = 3)
[1] 0.2524925
2. The measure of height in centimeters is Z = 2.54X, hence it has a normal
distribution with parameters µZ = 2.54µX = 177.8, σZ = 2.54σX = 7.62. In
the same way, the measure in metres is normally distributed with mean 1.778
and standard deviation 0.0762.
3 NORMAL DISTRIBUTION
6
3. For symmetric and unimodal distributions, shortest intervals are built starting
from the mode (coincident with the median) and moving symmetrically on the
left and on the right until the required area is reached. Since the area outside
the interval is 0.10, the endpoints of the interval are the quantiles x0.05 and
x0.95 . Their values (centimeters) are derived through R function qnorm.
> qnorm(c(0.05, 0.95), mean = 177.8, sd = 7.62)
[1] 165.2662 190.3338
Ex2 If X ∼ N (0, σ), find the density of Y =| X | (Rice, 2.54).
Solution. Note that Y | X |≥ 0, hence the pdf of Y is identically zero, for all y < 0.
To solve the problem we first derive the cdf FY (y). Observe that, for any fixed y ≥ 0
Y ≤ y ⇔ −y ≤ X ≤ y,
(see figure below) which implies
3
Y=Abs(X)
0
1
Abs(x)
2
y
−y
−3
−2
y
−1
0
1
2
3
x
FY (y) = FX (y) − FX (−y) + P (X = −y).
Since X is a continuous distribution symmetric about 0, P (X = −y) = 0 and
FX (−y) = 1 − FX (y). Hence
FY (y) = 2FX (y) − 1.
The pdf of Y is obtained by differentiation of the previous expression with respect
to y
d
fY (y) = (2FX (y) − 1) = 2fX (y).
dy
3 NORMAL DISTRIBUTION
7
0.8
Absolute Value of N(0, 1)
0.4
0.0
0.2
PDF of X and Abs(X)
0.6
X
Abs(X)
−4
−2
0
2
4
x
The figure compares the pdf of X and | X |, in the case σ = 1.
The R functions used to produce the plot are given below.
>
+
+
>
+
>
>
+
>
+
>
+
plot(function(x) 2 * dnorm(x, mean = 0, sd = 1), 0, 6, lwd = 2,
xlim = c(-4, 4), col = "red", xlab = "x", ylab = "PDF of X and Abs(X)",
main = "Absolute Value of N(0, 1)")
plot(function(x) dnorm(x, mean = 0, sd = 1), -6, 6, lwd = 2,
add = TRUE)
lines(c(-6, 0), c(0, 0), lwd = 2, col = "red")
lines(c(0, 0), c(0, 2 * dnorm(0, mean = 0, sd = 1)), lwd = 2,
col = "red", lty = "dashed")
plot(function(x) dnorm(x, mean = 0, sd = 1), -6, 6, lwd = 2,
add = TRUE)
legend("topleft", col = c("black", "red"), lwd = 2, legend = c("X",
"Abs(X)"))
A similar argumernt can be used to derive the distribution of Y = X 2 .
Ex3 If X ∼ N (µ, σ), prove that P (| X − µ |≤ 0.675σ) = 0.5 (Rice, 2.56).
Solution. The proof follows because
| X − µ |≤ 0.675σ ⇔ −0.675 ≤ (X − µ)/σ ≤ 0.675
and ±0.675 are the first and third quartiles of the standard normal distribution.