Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2.10. PROBABILITY DISTRIBUTIONS AND RANDOM NUMBERS 55 The svd() function returns a list with components corresponding to a vector of singular values, a matrix with columns corresponding to the left singular values, and a matrix with columns containing the right singular values. 2.10 Probability distributions and random number generation R can calculate quantiles and cumulative distribution values as well as generate random numbers for a large number of distributions. Random variables are commonly needed for simulation and analysis. A seed can be specified for the random number generator. This is important to allow replication of results (e.g., while testing and debugging). Information about random number seeds can be found in Section 2.10.11. Table 2.1 summarizes support for quantiles, cumulative distribution functions, and random numbers. Prepend d to the command to compute quantiles of a distribution dNAME(xvalue, parm1, ..., parmn), p for the cumulative distribution function, pNAME(xvalue, parm1, ..., parmn), q for the quantile function qNAME(prob, parm1, ..., parmn), and r to generate random variables rNAME(nrand, parm1, ..., parmn) where in the last case a vector of nrand values is the result. More information on probability distributions can be found in the CRAN Probability Distributions Task View. 2.10.1 Probability density function Example: See 2.13.7 Here we use the normal distribution as an example; others are shown in Table 2.1 on the next page. > dnorm(1.96, mean=0, sd=1) [1] 0.05844094 > dnorm(0, mean=0, sd=1) [1] 0.3989423 2.10.2 Cumulative density function Here we use the normal distribution as an example; others are shown in Table 2.1. > pnorm(1.96, mean=0, sd=1) [1] 0.9750021 56 CHAPTER 2. DATA MANAGEMENT Table 2.1: Quantiles, Probabilities, and Pseudorandom Number Generation: Available Distributions Distribution NAME Beta beta Beta-binomial betabin∗ binomial binom Cauchy cauchy chi-square chisq exponential exp F f gamma gamma geometric geom hypergeometric hyper inverse normal inv.gaussian∗ Laplace laplace∗ logistic logis lognormal lnorm negative binomial nbinom normal norm Poisson pois Student’s t t Uniform unif Weibull weibull Note: See Section 2.10 for details regarding syntax to call these routines. ∗ The betabin(), inv.gaussian(), and laplace() families of distributions are available using library(VGAM). 2.10.3 Quantiles of a probability density function Here we calculate the upper 97.5% percentile of the normal distribution as an example; others are shown in Table 2.1. > qnorm(.975, mean=0, sd=1) [1] 1.959964 2.10.4 Uniform random variables x = runif(n, min=0, max=1) The arguments specify the number of variables to be created and the range over which they are distributed (by default unit interval). 2.10. PROBABILITY DISTRIBUTIONS AND RANDOM NUMBERS 2.10.5 57 Multinomial random variables x = sample(1:r, n, replace=TRUE, prob=c(p1, p2, ..., pr)) P Here r pr = 1 are the desired probabilities (see also rmultinom() in the stats package as well as the cut() function). 2.10.6 Normal random variables Example: See 2.13.7 x1 = rnorm(n) x2 = rnorm(n, mean=mu, sd=sigma) The arguments specify the number of variables to be created and (optionally) the mean and standard deviation (default µ = 0 and σ = 1). 2.10.7 Multivariate normal random variables For the following, we first create a 3 × 3 covariance matrix. Then we generate 1000 realizations of a multivariate normal vector with the appropriate correlation or covariance. library(MASS) mu = rep(0, 3) Sigma = matrix(c(3, 1, 2, 1, 4, 0, 2, 0, 5), nrow=3) xvals = mvrnorm(1000, mu, Sigma) apply(xvals, 2, mean) or rmultnorm = function(n, mu, vmat, tol=1e-07) # a function to generate random multivariate Gaussians { p = ncol(vmat) if (length(mu)!=p) stop("mu vector is the wrong length") if (max(abs(vmat - t(vmat))) > tol) stop("vmat not symmetric") vs = svd(vmat) vsqrt = t(vs$v %*% (t(vs$u) * sqrt(vs$d))) ans = matrix(rnorm(n * p), nrow=n) %*% vsqrt ans = sweep(ans, 2, mu, "+") dimnames(ans) = list(NULL, dimnames(vmat)[[2]]) return(ans) } xvals = rmultnorm(1000, mu, Sigma) apply(xvals, 2, mean) 58 CHAPTER 2. DATA MANAGEMENT The returned object xvals, of dimension 1000×3, is generated from the variance covariance matrix denoted by Sigma, which has first row and column (3,1,2). An arbitrary mean vector can be specified using the c() function. Several techniques are illustrated in the definition of the rmultnorm() function. The first lines test for the appropriate arguments, and return an error (see 2.11.3) if the conditions are not satisfied. The singular value decomposition (see 2.9.14) is carried out on the variance covariance matrix, and the sweep() function is used to transform the univariate normal random variables generated by rnorm() to the desired mean and covariance. The dimnames() function applies the existing names (if any) for the variables in vmat, and the result is returned. 2.10.8 Truncated multivariate normal random variables library(tmvtnorm) x = rtmvnorm(n, mean, sigma, lower, upper) The arguments specify the number of variables to be created, the mean and standard deviation, and a vector of lower and upper truncation values. 2.10.9 Exponential random variables x = rexp(n, rate=lambda) The arguments specify the number of variables to be created and (optionally) the inverse of the mean (default λ = 1). 2.10.10 Other random variables Example: See 2.13.7 The list of probability distributions supported can be found in Table 2.1. In addition to these distributions, the inverse probability integral transform can be used to generate arbitrary random variables with invertible cumulative density function F (exploiting the fact that F −1 ∼ U (0, 1)). As an example, consider the generation of random variates from an exponential distribution with rate parameter λ, where F (X) = 1 − exp(−λX) = U . Solving for X yields X = − log(1 − U )/λ. If we generate 500 Uniform(0,1) variables, we can use this relationship to generate 500 exponential random variables with the desired rate parameter (see also 7.3.4, sampling from pathological distributions). lambda = 2 expvar = -log(1-runif(500))/lambda 2.11. CONTROL FLOW AND PROGRAMMING 2.10.11 59 Setting the random number seed The default behavior is a (pseudorandom) seed based on the system clock. To generate a replicable series of variates, first run set.seed(seedval) where seedval is a single integer for the default “Mersenne-Twister” random number generator. For example: set.seed(42) set.seed(Sys.time()) More information can be found using help(.Random.seed). 2.11 Control flow, programming, and data generation Here we show some basic aspects of control flow, programming, and data generation (see also 7.2, data generation and 1.6.2, writing functions). 2.11.1 Looping Example: See 7.1.2 x = numeric(i2-i1+1) # create placeholder for (i in 1:length(x)) { x[i] = rnorm(1) # this is slow and inefficient! } or (preferably) x = rnorm(i2-i1+1) # this is far better Most tasks that could be written as a loop are often dramatically faster if they are encoded as a vector operation (as in the second and preferred option above). Examples of situations where loops are particularly useful can be found in Sections 4.1.6 and 7.1.2. More information on control structures for looping and conditional processing can be found in help(Control). 2.11.2 Error recovery Example: See 2.13.2 try(expression, silent=FALSE) The try() function runs the given expression and traps any errors that may arise (displaying them on the standard error output device). The related function geterrmessage() can be used to display any errors. 2.13. HELP EXAMPLES 71 > with(ds, mean(cesd[female==1])) [1] 36.9 > tapply(ds$cesd, ds$female, mean) 0 1 31.6 36.9 > aggregate(ds$cesd, list(ds$female), mean) 1 2 Group.1 x 0 31.6 1 36.9 2.13.7 Probability distributions Data can easily be generated. As an example, we can find values of the normal (2.10.6) and t densities, and display them in Figure 2.1. > > > > x = seq(from=-4, to=4.2, length=100) normval = dnorm(x, 0, 1) dfval = 1 tval = dt(x, df=dfval) > > > > + + > plot(x, normval, type="n", ylab="f(x)", las=1) lines(x, normval, lty=1, lwd=2) lines(x, tval, lty=2, lwd=2) legend(1.1, .395, lty=1:2, lwd=2, legend=c(expression(N(mu == 0,sigma == 1)), paste("t with ", dfval," df", sep=""))) grid(nx=NULL, ny=NULL, col="darkgray") Mathematical symbols (6.2.13) representing the parameters of the normal distribution are included as part of the legend (6.2.15) to help differentiate the distributions. A grid (6.2.7) is also added. 72 CHAPTER 2. DATA MANAGEMENT 0.4 N(μ = 0, σ = 1) t with 1 df f(x) 0.3 0.2 0.1 0.0 −4 −2 0 2 4 x Figure 2.1: Comparison of standard normal and t distribution with 1 df.