Download 2.10 Probability distributions and random number generation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
2.10. PROBABILITY DISTRIBUTIONS AND RANDOM NUMBERS
55
The svd() function returns a list with components corresponding to a vector
of singular values, a matrix with columns corresponding to the left singular
values, and a matrix with columns containing the right singular values.
2.10
Probability distributions and random
number generation
R can calculate quantiles and cumulative distribution values as well as generate
random numbers for a large number of distributions. Random variables are
commonly needed for simulation and analysis.
A seed can be specified for the random number generator. This is important
to allow replication of results (e.g., while testing and debugging). Information
about random number seeds can be found in Section 2.10.11.
Table 2.1 summarizes support for quantiles, cumulative distribution functions, and random numbers. Prepend d to the command to compute quantiles
of a distribution dNAME(xvalue, parm1, ..., parmn), p for the cumulative
distribution function, pNAME(xvalue, parm1, ..., parmn), q for the quantile
function qNAME(prob, parm1, ..., parmn), and r to generate random variables rNAME(nrand, parm1, ..., parmn) where in the last case a vector of
nrand values is the result.
More information on probability distributions can be found in the CRAN
Probability Distributions Task View.
2.10.1
Probability density function
Example: See 2.13.7
Here we use the normal distribution as an example; others are shown in Table
2.1 on the next page.
> dnorm(1.96, mean=0, sd=1)
[1] 0.05844094
> dnorm(0, mean=0, sd=1)
[1] 0.3989423
2.10.2
Cumulative density function
Here we use the normal distribution as an example; others are shown in Table
2.1.
> pnorm(1.96, mean=0, sd=1)
[1] 0.9750021
56
CHAPTER 2. DATA MANAGEMENT
Table 2.1: Quantiles, Probabilities, and Pseudorandom Number Generation:
Available Distributions
Distribution
NAME
Beta
beta
Beta-binomial
betabin∗
binomial
binom
Cauchy
cauchy
chi-square
chisq
exponential
exp
F
f
gamma
gamma
geometric
geom
hypergeometric
hyper
inverse normal
inv.gaussian∗
Laplace
laplace∗
logistic
logis
lognormal
lnorm
negative binomial
nbinom
normal
norm
Poisson
pois
Student’s t
t
Uniform
unif
Weibull
weibull
Note: See Section 2.10 for details regarding syntax to call these routines.
∗
The betabin(), inv.gaussian(), and laplace() families of distributions
are available using library(VGAM).
2.10.3
Quantiles of a probability density function
Here we calculate the upper 97.5% percentile of the normal distribution as an
example; others are shown in Table 2.1.
> qnorm(.975, mean=0, sd=1)
[1] 1.959964
2.10.4
Uniform random variables
x = runif(n, min=0, max=1)
The arguments specify the number of variables to be created and the range
over which they are distributed (by default unit interval).
2.10. PROBABILITY DISTRIBUTIONS AND RANDOM NUMBERS
2.10.5
57
Multinomial random variables
x = sample(1:r, n, replace=TRUE, prob=c(p1, p2, ..., pr))
P
Here r pr = 1 are the desired probabilities (see also rmultinom() in the stats
package as well as the cut() function).
2.10.6
Normal random variables
Example: See 2.13.7
x1 = rnorm(n)
x2 = rnorm(n, mean=mu, sd=sigma)
The arguments specify the number of variables to be created and (optionally)
the mean and standard deviation (default µ = 0 and σ = 1).
2.10.7
Multivariate normal random variables
For the following, we first create a 3 × 3 covariance matrix. Then we generate 1000 realizations of a multivariate normal vector with the appropriate
correlation or covariance.
library(MASS)
mu = rep(0, 3)
Sigma = matrix(c(3, 1, 2,
1, 4, 0,
2, 0, 5), nrow=3)
xvals = mvrnorm(1000, mu, Sigma)
apply(xvals, 2, mean)
or
rmultnorm = function(n, mu, vmat, tol=1e-07)
# a function to generate random multivariate Gaussians
{
p = ncol(vmat)
if (length(mu)!=p)
stop("mu vector is the wrong length")
if (max(abs(vmat - t(vmat))) > tol)
stop("vmat not symmetric")
vs = svd(vmat)
vsqrt = t(vs$v %*% (t(vs$u) * sqrt(vs$d)))
ans = matrix(rnorm(n * p), nrow=n) %*% vsqrt
ans = sweep(ans, 2, mu, "+")
dimnames(ans) = list(NULL, dimnames(vmat)[[2]])
return(ans)
}
xvals = rmultnorm(1000, mu, Sigma)
apply(xvals, 2, mean)
58
CHAPTER 2. DATA MANAGEMENT
The returned object xvals, of dimension 1000×3, is generated from the variance
covariance matrix denoted by Sigma, which has first row and column (3,1,2).
An arbitrary mean vector can be specified using the c() function.
Several techniques are illustrated in the definition of the rmultnorm() function. The first lines test for the appropriate arguments, and return an error
(see 2.11.3) if the conditions are not satisfied. The singular value decomposition (see 2.9.14) is carried out on the variance covariance matrix, and the
sweep() function is used to transform the univariate normal random variables
generated by rnorm() to the desired mean and covariance. The dimnames()
function applies the existing names (if any) for the variables in vmat, and the
result is returned.
2.10.8
Truncated multivariate normal random variables
library(tmvtnorm)
x = rtmvnorm(n, mean, sigma, lower, upper)
The arguments specify the number of variables to be created, the mean and
standard deviation, and a vector of lower and upper truncation values.
2.10.9
Exponential random variables
x = rexp(n, rate=lambda)
The arguments specify the number of variables to be created and (optionally)
the inverse of the mean (default λ = 1).
2.10.10
Other random variables
Example: See 2.13.7
The list of probability distributions supported can be found in Table 2.1. In
addition to these distributions, the inverse probability integral transform can be
used to generate arbitrary random variables with invertible cumulative density
function F (exploiting the fact that F −1 ∼ U (0, 1)). As an example, consider
the generation of random variates from an exponential distribution with rate
parameter λ, where F (X) = 1 − exp(−λX) = U . Solving for X yields X =
− log(1 − U )/λ. If we generate 500 Uniform(0,1) variables, we can use this
relationship to generate 500 exponential random variables with the desired rate
parameter (see also 7.3.4, sampling from pathological distributions).
lambda = 2
expvar = -log(1-runif(500))/lambda
2.11. CONTROL FLOW AND PROGRAMMING
2.10.11
59
Setting the random number seed
The default behavior is a (pseudorandom) seed based on the system clock.
To generate a replicable series of variates, first run set.seed(seedval) where
seedval is a single integer for the default “Mersenne-Twister” random number
generator. For example:
set.seed(42)
set.seed(Sys.time())
More information can be found using help(.Random.seed).
2.11
Control flow, programming, and data
generation
Here we show some basic aspects of control flow, programming, and data generation (see also 7.2, data generation and 1.6.2, writing functions).
2.11.1
Looping
Example: See 7.1.2
x = numeric(i2-i1+1)
# create placeholder
for (i in 1:length(x)) {
x[i] = rnorm(1) # this is slow and inefficient!
}
or (preferably)
x = rnorm(i2-i1+1)
# this is far better
Most tasks that could be written as a loop are often dramatically faster if
they are encoded as a vector operation (as in the second and preferred option
above). Examples of situations where loops are particularly useful can be found
in Sections 4.1.6 and 7.1.2. More information on control structures for looping
and conditional processing can be found in help(Control).
2.11.2
Error recovery
Example: See 2.13.2
try(expression, silent=FALSE)
The try() function runs the given expression and traps any errors that may
arise (displaying them on the standard error output device). The related function geterrmessage() can be used to display any errors.
2.13. HELP EXAMPLES
71
> with(ds, mean(cesd[female==1]))
[1] 36.9
> tapply(ds$cesd, ds$female, mean)
0
1
31.6 36.9
> aggregate(ds$cesd, list(ds$female), mean)
1
2
Group.1
x
0 31.6
1 36.9
2.13.7
Probability distributions
Data can easily be generated. As an example, we can find values of the normal
(2.10.6) and t densities, and display them in Figure 2.1.
>
>
>
>
x = seq(from=-4, to=4.2, length=100)
normval = dnorm(x, 0, 1)
dfval = 1
tval = dt(x, df=dfval)
>
>
>
>
+
+
>
plot(x, normval, type="n", ylab="f(x)", las=1)
lines(x, normval, lty=1, lwd=2)
lines(x, tval, lty=2, lwd=2)
legend(1.1, .395, lty=1:2, lwd=2,
legend=c(expression(N(mu == 0,sigma == 1)),
paste("t with ", dfval," df", sep="")))
grid(nx=NULL, ny=NULL, col="darkgray")
Mathematical symbols (6.2.13) representing the parameters of the normal distribution are included as part of the legend (6.2.15) to help differentiate the
distributions. A grid (6.2.7) is also added.
72
CHAPTER 2. DATA MANAGEMENT
0.4
N(μ = 0, σ = 1)
t with 1 df
f(x)
0.3
0.2
0.1
0.0
−4
−2
0
2
4
x
Figure 2.1: Comparison of standard normal and t distribution with 1 df.