Download Lab 2 Read in the table from “lab2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Lab 2
Read in the table from “lab2-data.txt” X=read.table(file= “provide file location…”) or addpath(‘provide file location’);
X=load(‘lab2-data.txt’) see lab 1 if you do not remember. There are eight columns of data each with n=250
observations. The columns are named x1,x2,…,x8 and can be seen by typing names(X) no Matlab the dim(X) size(X) will
return 250 and 8 as the row count and column count. X[,5] and X$x5 or X(:,5) will return the same column of data in R.
The data is each column was generated from a different distribution, ie. Binomial, Poisson, Uniform, Beta, Gamma,
Exponential, Normal, Mixture of Normals.
1. Use column 1 data:
Make a histogram of the data ie. hist(X[,1], main=runif(1)) and plot(density(X[,1])) will give a similar plot but
sometimes a bit more helpful, or use histogram(X(:,1)) and histfit(X(:,1)). If you need a boxplot or
mean/min/max; you can do that.
What is the min/max? __________________________
Continuous or discrete? _________________________
What are possible distributions? _____________________
2-8.
Answer #1 for all 8 columns….
par(mfrow=c(4,2), mar=c(3,2,1,1)) # This is optional code for R
for(i in 1:8) {hist(X[,i], main=i)}
# it will produce the 8 plots all at once with a “for loop”
par(mfrow=c(1,1), mar=c(4,4,4,4)) #you will need this last line of code to reset your plot window
No.
1
2
3
4
5
6
7
8
Min/max Continuous/
Discrete
Possible Distributions
(see list above in purple)
9. How was the data generated? Let’s generate data… how about 100 data points from a Normal distribution
with mean 2.3 and standard deviation 5 y= rnorm(100, 2.3, 5) y=normrnd(2.3,5,[100,1]) Histogram.
Min/max __________________________________
The theoretical mean is 5 but what is the sample mean of the collected data?
Sample mean _____________________________
With 100 data points is the sample mean close to the theoretical mean?
____________________________________________________________
If you generate only 20 points what is the sample mean? _________________________
If you generate only 5 points what is the sample mean? __________________________
More data usually brings more accuracy to the sample mean.
10. Let’s use column 4 of data. This is continuous data and could be Gamma or Normal. To run a test we need MLE
parameter estimates. (I highlighted in purple where the data goes).
nloglik=function(par) { -sum(log(dgamma( X[,4],par[1],par[2])))}
output=nlminb(start=c(2,2),nloglik, lower = 0, upper = Inf )
output$par
mle(X(:,4),'distribution','gamma') in Matlab the second
parameter is returned as 1/beta1 so to get beta1 take the reciprocal of what
Matlab spits out.
The other option in R is to download a package called MASS, which has a lot
of functions in it. Type install.packages("MASS") this will walk you
through downloading. Then type library(MASS) to use this package. Now try
fitdistr(X[,4], "gamma") also try fitdistr(X[,4], "normal") the extra
parenthetical output is information about how good the fitted values are, we
will not use it.
What are the parameters for the Gamma distribution? alpha1=_______ beta1=_____
Now find the MLE for the parameters for the Normal distribution. (hint: change the dgamma to dnorm or the
gamma to normal) mu1=___________ and sigma1=_______________
It happens that the MLE for the Normal are the sample mean and standard deviation.
Find them: mean(X[,4])______________ and sd(X[,4])______________
How do they compare against what the algorithm (approximate method) gave?____________ _____________
11
A. Use the MLE estimates to make qq-plots for the Gamma and Normal distributions.
qqplot(X[,4],rgamma(1000,alpha1,beta1), main=runif(1))
qqline(X[,4], distribution = function(p) qgamma(p, alpha1,beta1))
pd = fitdist(X(:,4),'Gamma');
qqplot(X(:,4), pd )
qqplot(X[,4],rnorm(1000,mu1,sigma1))
qqline(X[,4], distribution = function(p) qnorm(p, mu1,sigma1))
pd = fitdist(X(:,4),'Normal');
qqplot(X(:,4), pd )
Which distribution is better according to the qq-plots?__________________________________________
Do you see any outliers (ie. a single point that seems out of place)?________________________________
B. Use the Kolmogorov-Smirnov Test to test the hypothesis that the data follows a
Gamma(shape=alpha1,rate=beta1) and then that it follows a Normal(mu1,sigma1).
ks.test(X[,4], “pgamma”, alpha1, beta1)
test_cdf = makedist('gamma','A',alpha1,'B',beta1);
[h,p] = kstest(X(:,4),'CDF',test_cdf)
ks.test(X[,4],”pnorm”, mean=mu1, sd=sigma1)
test_cdf = makedist('normal','mu',mu1,'sigma',sigma1);
[h,p] = kstest(X(:,4),'CDF',test_cdf)
What are the p-values from each test? What can you conclude?
Gamma:_________________________________________________________________
Normal:__________________________________________________________________
12. Test the data from column 1 and 4 to see if they come from the same distribution; this time you do not need
the MLE parameter values: ks.test(X[,1],X[,4]) or [h,p]=kstest2(X(:,1), X(:,4))
What is the p-value?________________________
Are the two columns of data from the same distribution (circle one) :
YES
NO
13. Run the Pearson’s Chi-Squared Test chisq.test( expected, observed) for column 2. We want to test if this data
is Poisson(lambda=6). First we need the data in a different form; the observed frequency of seeing a 0,1,2,…,
obsfreq=table(X[,2]) and obsfreq to view it or obsfreq=tabulate(X(:,2)) should give the counts of each value.
We also need expected=250*dpois(1:13,6) thus teststat=sum((obsfreqexpected)^2/expected) and to get a p-value use: 1−pchisq(teststat, 13−1)
Matlab uses: expected=transpose(250*poisspdf(1:13,6)) thus
teststat=sum(power(obsfreq(:,2)-expected,2)./expected) and to get a pvalue use:
1 - chi2cdf(teststat, 13-1)
What is the p-value?_________________________
Is this data Poisson(6)?_________________________________________________________
Note: In R there is a library called nortest that contains 5 tests for the Normal distribution alone: sf.test(), ad.test(),
cvm.test(), lillie.test(), pearson.test(). So there are many more options than just the ks.test()