Download activity4

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia, lookup

Transcript
Stat 401 Lab Activity 4
Wednesday, November 2, 2005
Part I: Demonstration of the correlation coefficient
In this activity we will generate and plot bivariate data having different correlation
coefficients. We will use a regression model to generate the (X,Y) values. In particular,
we will assume that the regression function of Y on X is linear
E(Y|X=x) = 2 + x,
For each value x of X we will generate a value of Y by adding error (noise) on the
regression function. Thus, for a value x of X, a value of Y is generated as
Y = 2 + x + E, where E is the noise.
We will always generate X values from the normal distribution with mean 9 and standard
deviation 2. The noise, E, will also be generated from the normal distribution with zero
mean. Different values of the variance of E will produce data sets with different
correlation coefficient. Since the regression function will always be the same, this activity
also demonstrates that the regression function, though it describes one aspect of the
relation between X and Y, is not designed to quantify the degree of dependence between
X and Y.
We will generate, plot and compute the correlation coefficient of three data sets generated
with E having standard deviation of 1, 2, and 3, respectively. The Minitab command
sequences (described below only for the case of E having standard deviation 1) are:
1. Generate 50 X values, store in C1
Calc> Random Data> Normal> Generate “50” rows of data; Store in column:
“C1”, with mean =9.0 and Standard deviation =2.0 OK
2. Generate 50 E values, store in C2
Calc> Random Data> Normal> Generate “50” rows of data; Store in column:
“C2”, with mean =0.0 and Standard deviation =1.0 OK
3. Calculate the 50 Y values, store in C3
Calc>Calculator>Store result in variable: C3, Expression: 2 + C1 + C2, OK
Do a scatter plot of the 50 (X,Y) values, and compute the correlation coefficient of
the (X,Y) values, using command sequences already described.
4. For homework 7, generate three sets of (X,Y) values working as above, except for
generating X values from a normal with zero mean and standard deviation 2, and using
the regression function
E(Y|X=x) = 2 + x ^2
Use the same three standard deviations for the noise E, as before. Compare the resulting
correlation coefficients with the corresponding (i.e. same standard deviation of the noise)
correlation coefficients obtained by using the linear regression function.
.
Part II. The Sampling Distribution of the Sample Mean and Sample Variance.
In this activity, we generate 100 samples with sample size 10 from the normal
distribution with mean 0 and variance 1, and find the sample mean and sample variance
for each sample. This amounts to generating random numbers from the sampling
distribution of the sample mean and sample variance. Histograms and probability plots
can then be used to check the known facts about the distribution of the sample mean and
sample variance. We begin by generating random numbers from the sampling distribution
of X .
1. Generate 100 samples of size 10:
Calc> Random Data> Normal> Generate “10” rows of data; Store in columns: “C1C100”, with mean =1 and Standard deviation =1. OK.
2.Find the sample means for the 100 samples and store them in a column.
This is done in two steps. The first is:
Data> Stack > Columns> Stack the following columns: “C1-C100”; under Store
stacked data in: select Column of current worksheet and fill “Data”, Store
subscripts in “Sample”.
Thus all 100 samples, each with size 10 (so a total of 1000 observations), are “stacked”
in column C101, and C102-T contains information as to which sample the observation in
the corresponding row of C101 came from.
The second step actually finds the 100 sample means and stores them:
Stat> Basic Statistics> Store Descriptive Statistics> select “Data” and “Sample” for
Variables and By variables (optional); click Statistics, select “Mean”. OK, OK.
Then columns C103-T, C104 and C105 show: the sample number, sample means X to
that certain sample, and sample size.
Check the normality of sample means.
a. Histogram
b. Probability Plot.
Graph> Probability Plots> Select “Single”> Select “C104 Mean1” for Graph
variable; click Distribution> select “Normal” under distribution; 0 and 0.31628
for Mean and StDev. OK, OK.
5. The Sampling Distribution of the Sample Variance.
We next generate random numbers from the sampling distribution of the sample variance.
We do this by calculating the sample variance of each of the 100 samples of size 10 and
storing them in a column. A histogram and probability plot can be used to check the facts
about the sampling distribution of the sample variance. The basic fact is that for a sample
of size n from a normal distribution,
(n  1) S 2

2
~  n21
For samples of size n=10, which is what we,
(n  1) S 2

2

10  1S 2

2
~  92
Moreover, in our case the population variance equals 1.
Let’s find 100 sample variances and compare with Chi-square distribution.
Repeat part 2 except 2 changes:
 Stat> Basic Statistics> Store Descriptive Statistics> select “Data” and “Sample” for
Variables and By variables (optional); click Statistics, select “Variance”. OK, OK.
 Probability Plot.
Graph> Probability Plots> Select “Single”> Select “Variance2” for Graph variable;
click Distribution> select “Gamma” under distribution; 4.5 and 2 for Shape and Scale.
OK, OK.