Download MATH 755 SPRING 2003

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Transcript
MATH 755 SPRING 2003
SOME BASIC DATA ALNALYSIS
Data analysis can be performed by using either command line or pull down menus.
Entering data into Splus
Select Data from the main menu and then click on Select Data option. Then choose one
of the following options:
1. Select Existing Data, and enter the name of the data set (that is already in Splus)
you want to use.
2. Select New Data, and enter the name of the new data set you want to enter
manually in the data window. Then click OK and enter your data set in the data
window.
3. Select Import Data, then OK, and click on Browse in the new window that opens
up. Then proceed by selecting the data set from your hard drive or a floppy.
Computing the sample mean vector and the sample covariance matrix S
To get the column means for data contained in a matrix x type colMeans(x) at the
prompt. To get the matrix S of sample variances type var(x) at the prompt. Alternatively,
using the menus select Statistics, then Data Summaries, select your data set and choose
either Covariances or Correlations (to get either S or R). To store the resulting matrix in
Splus enter some name in the appropriate window.
Generating normal data
To generate a random sample of size n from a univariate normal distribution with mean a
and standard deviation b type rnorm(n,mean=a, sd=b). for example, the following
command will generate a random sample of size 100 from a normal distribution with
mean 10 and standard deviation 3 and store the results in vector y:
>y<-rnorm(100,mean=10,sd=3)
To generate a random sample of size n from a multivariate normal distribution with mean
vector a (of dimension p) and covariance matrix b (p by p), and store the result in a data
matrix x, type x<-rmvnorm(n, mean=a, cov=b).
Normal probability plot
To construct a normal probability plot (QQ plot) of sample quantiles obtained from a
univariate data set x versus quantiles of the standard normal distribution type qqnorm(x)
QQ (probability) plots with other distributions
To produce a probability (QQ) plot of data (one dimensional vector of size n) against
some standard distribution (such as chi-square) proceed as follows. First, generate a
vector of probabilities (i-1/2)/n, i=1, 2, …,n, suitable to produce a QQ plot. This is
achieved by the Splus command p<-ppoints(n). The input to this function can also be the
data set of length n. Then, generate the corresponding vector of quantiles from a given
distribution. For example, for the chi-square distribution use the Splus command
qchisq(p, df=k), where p is the vector of probabilities and k is the number of degrees of
freedom. Finally, plot sort(data) (this is the data sorted from the smallest to the largest –
order statistics) against the vector of quantiles. Here is an example
>x<-rchisq(100,df=4)
This generates a random sample (data) x of size 100 from the chi-square distribution with
4 degrees of freedom. We now proceed with the QQ plot in two steps:
>p<-ppoints(x) (this generates the vector of probabilities p)
> quant<-qchisq(p, df=4) (this generates the vector of quantiles)
> plot(quant,sort(x)) (this makes the QQ plot)
The above can be done in one step as follows:
>plot(qchisq(ppoints(x),df=4),sort(x))
Please see online manuals for information about standard distributions available in Splus
(Guide to Statistics Vol. 1 Chapter 3).
Generalized distance
To convert an n by p data matrix x into an n by 1 vector y of generalized distances
(x_j-x_bar)’ (S-inverse)(x_j-x_bar) of the n data points (x_j, j=1,…,n, the rows of x)
from the mean vector x_bar use the command y<-gendist(x). This function is as follows:
gendist<-function(x)
{
p_length(x[1,])
a <- var(x)
b <- apply(x, 2, FUN = mean)
-2 * log(((2 * pi)^(p/2)) * sqrt(det(a)) * apply(x, 1,
FUN = dmvnorm, mean = b, cov = a))
}
and is available on the machines in the Math Center.
Box-Cox transformation
To find the optimal lambda in the Box-Cox transformation plot the sample variance of
the transformed data as in (4-37) (see the text) versus lambda and choose a convenient
value of lambda near the minimum. To make the plot, use the command box.cox(x),
where x is a univariate data set (of positive values). Here the box.cox function:
box.cox<-function(x){
a_exp(mean(log(x)))
y_1:100
lam_1:100
for (i in 1:100){
lam[i]_-1.001+0.02*i
y[i]_var((x^lam[i]-1)/(lam[i]*a^(lam[i]-1)))
}
plot(lam,y)
}
To apply the Box-Cox transformation with a given lambda on a data set x type (x^lambda
-1)/lambda if lambda is not zero or log(x) if lambda=0.