Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Homework 2 Due day: 02/10/2004 Problem 1: Consider that somebody has both a die and a coin. Each time, he can choose to roll the die or to flip the coin. The decision is based on a Bernoulli distribution. The observed number for flipping the coin is either 0 or 1, and the observed number for rolling the die is either 0, 1, 2, 3, 4, or 5. By repeating this experiment for m time, you will observed a sequence of numbers (n1, n2, …, nm). The observation data can be found in http://www.cse.msu.edu/~cse847/assignments/mixture_data.txt. a) Assume that each experiment is independent from the previous one. Meanwhile both the coin and the die are fair ones, with equal probability for each possible number. Compute the probability p in Bernoulli distribution using maximum likelihood estimation that fits the observations. b) Still assume that each experiment is independent. However, now we only know the die is a fair one but not the coin. Take into count that the coin can be bias and re-compute your answer for the probability p in Bernoulli distribution. You also need to the bias coin, which is the probability for observing 1 by randomly flipping the coin. Problem 2: 1. Build the k nearest neighbor classifier that uses the leave one out approach for determining the appropriate k value. The training data can be found in file http://www.cse.msu.edu/~cse847/assignments/spam.train.txt. Each row in the file corresponds to a training data point. The last attribute of each data point is the class label (either +1 or -1) and the rest attributes are the input features. Submit your code and the best k value. 2. Under the same directory, you will find the test data in file http://www.cse.msu.edu/~cse847/assignments/spam.test.txt. The format of the test file is same as that for the training file except that it does not have class labels. Apply your k nearest neighbor model to predict class labels for the test data. Submit your prediction results on the test data in the format that each row is a predicted class label for a test data point. The grade is based on the classification accuracy of your kNN model on the test data. Problem 3: Consider N data points uniformly distributed in a p-dimensional unit ball centered at original. Consider the nearest neighbor estimate at the original. Prove that the mean distance from the original to the closest data point is: 1/ p 1 d ( p, N ) 1 2 N Furthermore, show that the above expression can be simplified as d ( p, N ) ~ 1 N >> 1, p >> 1, and p >> logN. log N when p Problem 4: In lecture 5, we discussed how to apply a Gaussian generative model to the problem of predicting the gender of an individual given his/her height. In that problem, we claimed that given training data h1m , h2m ,..., hNmm ; h1f , h2f ,..., hNf , the maximum f N Nm N f likelihood solution is: i 1 him , Nm m Nm hif i 1 Nf f Nf i 1 (him m )2 , p Nm m2 Nm (hif i 1 Nf , 2f Nf f )2 m , pf Nm N Nf N In the above, him represents the height of a man, and hi f represents the height of a woman. N m and N f stand for the population of male and female, respectively. N is the total number of people in this sample. m , m , f , f are the mean and variance for men and women, respectively. pm and p f are the class priors for men and women, respectively. In this homework, you need to show that the above solution does maximize the likelihood of training data. Problem 5: Build and test a Naïve Bayes classifier using the same training and testing data as described in Problem 2. Assume each attribute of the input data follows a Gaussian distribution. Submit your code for training and testing the Naïve Bayes classifier and your classification results on the test data.