Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lab 2 Read in the table from “lab2-data.txt” X=read.table(file= “provide file location…”) or addpath(‘provide file location’); X=load(‘lab2-data.txt’) see lab 1 if you do not remember. There are eight columns of data each with n=250 observations. The columns are named x1,x2,…,x8 and can be seen by typing names(X) no Matlab the dim(X) size(X) will return 250 and 8 as the row count and column count. X[,5] and X$x5 or X(:,5) will return the same column of data in R. The data is each column was generated from a different distribution, ie. Binomial, Poisson, Uniform, Beta, Gamma, Exponential, Normal, Mixture of Normals. 1. Use column 1 data: Make a histogram of the data ie. hist(X[,1], main=runif(1)) and plot(density(X[,1])) will give a similar plot but sometimes a bit more helpful, or use histogram(X(:,1)) and histfit(X(:,1)). If you need a boxplot or mean/min/max; you can do that. What is the min/max? __________________________ Continuous or discrete? _________________________ What are possible distributions? _____________________ 2-8. Answer #1 for all 8 columns…. par(mfrow=c(4,2), mar=c(3,2,1,1)) # This is optional code for R for(i in 1:8) {hist(X[,i], main=i)} # it will produce the 8 plots all at once with a “for loop” par(mfrow=c(1,1), mar=c(4,4,4,4)) #you will need this last line of code to reset your plot window No. 1 2 3 4 5 6 7 8 Min/max Continuous/ Discrete Possible Distributions (see list above in purple) 9. How was the data generated? Let’s generate data… how about 100 data points from a Normal distribution with mean 2.3 and standard deviation 5 y= rnorm(100, 2.3, 5) y=normrnd(2.3,5,[100,1]) Histogram. Min/max __________________________________ The theoretical mean is 5 but what is the sample mean of the collected data? Sample mean _____________________________ With 100 data points is the sample mean close to the theoretical mean? ____________________________________________________________ If you generate only 20 points what is the sample mean? _________________________ If you generate only 5 points what is the sample mean? __________________________ More data usually brings more accuracy to the sample mean. 10. Let’s use column 4 of data. This is continuous data and could be Gamma or Normal. To run a test we need MLE parameter estimates. (I highlighted in purple where the data goes). nloglik=function(par) { -sum(log(dgamma( X[,4],par[1],par[2])))} output=nlminb(start=c(2,2),nloglik, lower = 0, upper = Inf ) output$par mle(X(:,4),'distribution','gamma') in Matlab the second parameter is returned as 1/beta1 so to get beta1 take the reciprocal of what Matlab spits out. The other option in R is to download a package called MASS, which has a lot of functions in it. Type install.packages("MASS") this will walk you through downloading. Then type library(MASS) to use this package. Now try fitdistr(X[,4], "gamma") also try fitdistr(X[,4], "normal") the extra parenthetical output is information about how good the fitted values are, we will not use it. What are the parameters for the Gamma distribution? alpha1=_______ beta1=_____ Now find the MLE for the parameters for the Normal distribution. (hint: change the dgamma to dnorm or the gamma to normal) mu1=___________ and sigma1=_______________ It happens that the MLE for the Normal are the sample mean and standard deviation. Find them: mean(X[,4])______________ and sd(X[,4])______________ How do they compare against what the algorithm (approximate method) gave?____________ _____________ 11 A. Use the MLE estimates to make qq-plots for the Gamma and Normal distributions. qqplot(X[,4],rgamma(1000,alpha1,beta1), main=runif(1)) qqline(X[,4], distribution = function(p) qgamma(p, alpha1,beta1)) pd = fitdist(X(:,4),'Gamma'); qqplot(X(:,4), pd ) qqplot(X[,4],rnorm(1000,mu1,sigma1)) qqline(X[,4], distribution = function(p) qnorm(p, mu1,sigma1)) pd = fitdist(X(:,4),'Normal'); qqplot(X(:,4), pd ) Which distribution is better according to the qq-plots?__________________________________________ Do you see any outliers (ie. a single point that seems out of place)?________________________________ B. Use the Kolmogorov-Smirnov Test to test the hypothesis that the data follows a Gamma(shape=alpha1,rate=beta1) and then that it follows a Normal(mu1,sigma1). ks.test(X[,4], “pgamma”, alpha1, beta1) test_cdf = makedist('gamma','A',alpha1,'B',beta1); [h,p] = kstest(X(:,4),'CDF',test_cdf) ks.test(X[,4],”pnorm”, mean=mu1, sd=sigma1) test_cdf = makedist('normal','mu',mu1,'sigma',sigma1); [h,p] = kstest(X(:,4),'CDF',test_cdf) What are the p-values from each test? What can you conclude? Gamma:_________________________________________________________________ Normal:__________________________________________________________________ 12. Test the data from column 1 and 4 to see if they come from the same distribution; this time you do not need the MLE parameter values: ks.test(X[,1],X[,4]) or [h,p]=kstest2(X(:,1), X(:,4)) What is the p-value?________________________ Are the two columns of data from the same distribution (circle one) : YES NO 13. Run the Pearson’s Chi-Squared Test chisq.test( expected, observed) for column 2. We want to test if this data is Poisson(lambda=6). First we need the data in a different form; the observed frequency of seeing a 0,1,2,…, obsfreq=table(X[,2]) and obsfreq to view it or obsfreq=tabulate(X(:,2)) should give the counts of each value. We also need expected=250*dpois(1:13,6) thus teststat=sum((obsfreqexpected)^2/expected) and to get a p-value use: 1−pchisq(teststat, 13−1) Matlab uses: expected=transpose(250*poisspdf(1:13,6)) thus teststat=sum(power(obsfreq(:,2)-expected,2)./expected) and to get a pvalue use: 1 - chi2cdf(teststat, 13-1) What is the p-value?_________________________ Is this data Poisson(6)?_________________________________________________________ Note: In R there is a library called nortest that contains 5 tests for the Normal distribution alone: sf.test(), ad.test(), cvm.test(), lillie.test(), pearson.test(). So there are many more options than just the ks.test()