Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Transcript

Stat 401 Lab Activity 4 Wednesday, November 2, 2005 Part I: Demonstration of the correlation coefficient In this activity we will generate and plot bivariate data having different correlation coefficients. We will use a regression model to generate the (X,Y) values. In particular, we will assume that the regression function of Y on X is linear E(Y|X=x) = 2 + x, For each value x of X we will generate a value of Y by adding error (noise) on the regression function. Thus, for a value x of X, a value of Y is generated as Y = 2 + x + E, where E is the noise. We will always generate X values from the normal distribution with mean 9 and standard deviation 2. The noise, E, will also be generated from the normal distribution with zero mean. Different values of the variance of E will produce data sets with different correlation coefficient. Since the regression function will always be the same, this activity also demonstrates that the regression function, though it describes one aspect of the relation between X and Y, is not designed to quantify the degree of dependence between X and Y. We will generate, plot and compute the correlation coefficient of three data sets generated with E having standard deviation of 1, 2, and 3, respectively. The Minitab command sequences (described below only for the case of E having standard deviation 1) are: 1. Generate 50 X values, store in C1 Calc> Random Data> Normal> Generate “50” rows of data; Store in column: “C1”, with mean =9.0 and Standard deviation =2.0 OK 2. Generate 50 E values, store in C2 Calc> Random Data> Normal> Generate “50” rows of data; Store in column: “C2”, with mean =0.0 and Standard deviation =1.0 OK 3. Calculate the 50 Y values, store in C3 Calc>Calculator>Store result in variable: C3, Expression: 2 + C1 + C2, OK Do a scatter plot of the 50 (X,Y) values, and compute the correlation coefficient of the (X,Y) values, using command sequences already described. 4. For homework 7, generate three sets of (X,Y) values working as above, except for generating X values from a normal with zero mean and standard deviation 2, and using the regression function E(Y|X=x) = 2 + x ^2 Use the same three standard deviations for the noise E, as before. Compare the resulting correlation coefficients with the corresponding (i.e. same standard deviation of the noise) correlation coefficients obtained by using the linear regression function. . Part II. The Sampling Distribution of the Sample Mean and Sample Variance. In this activity, we generate 100 samples with sample size 10 from the normal distribution with mean 0 and variance 1, and find the sample mean and sample variance for each sample. This amounts to generating random numbers from the sampling distribution of the sample mean and sample variance. Histograms and probability plots can then be used to check the known facts about the distribution of the sample mean and sample variance. We begin by generating random numbers from the sampling distribution of X . 1. Generate 100 samples of size 10: Calc> Random Data> Normal> Generate “10” rows of data; Store in columns: “C1C100”, with mean =1 and Standard deviation =1. OK. 2.Find the sample means for the 100 samples and store them in a column. This is done in two steps. The first is: Data> Stack > Columns> Stack the following columns: “C1-C100”; under Store stacked data in: select Column of current worksheet and fill “Data”, Store subscripts in “Sample”. Thus all 100 samples, each with size 10 (so a total of 1000 observations), are “stacked” in column C101, and C102-T contains information as to which sample the observation in the corresponding row of C101 came from. The second step actually finds the 100 sample means and stores them: Stat> Basic Statistics> Store Descriptive Statistics> select “Data” and “Sample” for Variables and By variables (optional); click Statistics, select “Mean”. OK, OK. Then columns C103-T, C104 and C105 show: the sample number, sample means X to that certain sample, and sample size. Check the normality of sample means. a. Histogram b. Probability Plot. Graph> Probability Plots> Select “Single”> Select “C104 Mean1” for Graph variable; click Distribution> select “Normal” under distribution; 0 and 0.31628 for Mean and StDev. OK, OK. 5. The Sampling Distribution of the Sample Variance. We next generate random numbers from the sampling distribution of the sample variance. We do this by calculating the sample variance of each of the 100 samples of size 10 and storing them in a column. A histogram and probability plot can be used to check the facts about the sampling distribution of the sample variance. The basic fact is that for a sample of size n from a normal distribution, (n 1) S 2 2 ~ n21 For samples of size n=10, which is what we, (n 1) S 2 2 10 1S 2 2 ~ 92 Moreover, in our case the population variance equals 1. Let’s find 100 sample variances and compare with Chi-square distribution. Repeat part 2 except 2 changes: Stat> Basic Statistics> Store Descriptive Statistics> select “Data” and “Sample” for Variables and By variables (optional); click Statistics, select “Variance”. OK, OK. Probability Plot. Graph> Probability Plots> Select “Single”> Select “Variance2” for Graph variable; click Distribution> select “Gamma” under distribution; 4.5 and 2 for Shape and Scale. OK, OK.