Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
MATH 755 SPRING 2003 SOME BASIC DATA ALNALYSIS Data analysis can be performed by using either command line or pull down menus. Entering data into Splus Select Data from the main menu and then click on Select Data option. Then choose one of the following options: 1. Select Existing Data, and enter the name of the data set (that is already in Splus) you want to use. 2. Select New Data, and enter the name of the new data set you want to enter manually in the data window. Then click OK and enter your data set in the data window. 3. Select Import Data, then OK, and click on Browse in the new window that opens up. Then proceed by selecting the data set from your hard drive or a floppy. Computing the sample mean vector and the sample covariance matrix S To get the column means for data contained in a matrix x type colMeans(x) at the prompt. To get the matrix S of sample variances type var(x) at the prompt. Alternatively, using the menus select Statistics, then Data Summaries, select your data set and choose either Covariances or Correlations (to get either S or R). To store the resulting matrix in Splus enter some name in the appropriate window. Generating normal data To generate a random sample of size n from a univariate normal distribution with mean a and standard deviation b type rnorm(n,mean=a, sd=b). for example, the following command will generate a random sample of size 100 from a normal distribution with mean 10 and standard deviation 3 and store the results in vector y: >y<-rnorm(100,mean=10,sd=3) To generate a random sample of size n from a multivariate normal distribution with mean vector a (of dimension p) and covariance matrix b (p by p), and store the result in a data matrix x, type x<-rmvnorm(n, mean=a, cov=b). Normal probability plot To construct a normal probability plot (QQ plot) of sample quantiles obtained from a univariate data set x versus quantiles of the standard normal distribution type qqnorm(x) QQ (probability) plots with other distributions To produce a probability (QQ) plot of data (one dimensional vector of size n) against some standard distribution (such as chi-square) proceed as follows. First, generate a vector of probabilities (i-1/2)/n, i=1, 2, …,n, suitable to produce a QQ plot. This is achieved by the Splus command p<-ppoints(n). The input to this function can also be the data set of length n. Then, generate the corresponding vector of quantiles from a given distribution. For example, for the chi-square distribution use the Splus command qchisq(p, df=k), where p is the vector of probabilities and k is the number of degrees of freedom. Finally, plot sort(data) (this is the data sorted from the smallest to the largest – order statistics) against the vector of quantiles. Here is an example >x<-rchisq(100,df=4) This generates a random sample (data) x of size 100 from the chi-square distribution with 4 degrees of freedom. We now proceed with the QQ plot in two steps: >p<-ppoints(x) (this generates the vector of probabilities p) > quant<-qchisq(p, df=4) (this generates the vector of quantiles) > plot(quant,sort(x)) (this makes the QQ plot) The above can be done in one step as follows: >plot(qchisq(ppoints(x),df=4),sort(x)) Please see online manuals for information about standard distributions available in Splus (Guide to Statistics Vol. 1 Chapter 3). Generalized distance To convert an n by p data matrix x into an n by 1 vector y of generalized distances (x_j-x_bar)’ (S-inverse)(x_j-x_bar) of the n data points (x_j, j=1,…,n, the rows of x) from the mean vector x_bar use the command y<-gendist(x). This function is as follows: gendist<-function(x) { p_length(x[1,]) a <- var(x) b <- apply(x, 2, FUN = mean) -2 * log(((2 * pi)^(p/2)) * sqrt(det(a)) * apply(x, 1, FUN = dmvnorm, mean = b, cov = a)) } and is available on the machines in the Math Center. Box-Cox transformation To find the optimal lambda in the Box-Cox transformation plot the sample variance of the transformed data as in (4-37) (see the text) versus lambda and choose a convenient value of lambda near the minimum. To make the plot, use the command box.cox(x), where x is a univariate data set (of positive values). Here the box.cox function: box.cox<-function(x){ a_exp(mean(log(x))) y_1:100 lam_1:100 for (i in 1:100){ lam[i]_-1.001+0.02*i y[i]_var((x^lam[i]-1)/(lam[i]*a^(lam[i]-1))) } plot(lam,y) } To apply the Box-Cox transformation with a given lambda on a data set x type (x^lambda -1)/lambda if lambda is not zero or log(x) if lambda=0.