Download hw2

Homework 2 Due day: 02/10/2004 Problem 1: Consider that somebody has both a die and a coin. Each time, he can choose to roll the die or to flip the coin. The decision is based on a Bernoulli distribution. The observed number for flipping the coin is either 0 or 1, and the observed number for rolling the die is either 0, 1, 2, 3, 4, or 5. By repeating this experiment for m time, you will observed a sequence of numbers (n1, n2, …, nm). The observation data can be found in http://www.cse.msu.edu/~cse847/assignments/mixture_data.txt. a) Assume that each experiment is independent from the previous one. Meanwhile both the coin and the die are fair ones, with equal probability for each possible number. Compute the probability p in Bernoulli distribution using maximum likelihood estimation that fits the observations. b) Still assume that each experiment is independent. However, now we only know the die is a fair one but not the coin. Take into count that the coin can be bias and re-compute your answer for the probability p in Bernoulli distribution. You also need to the bias coin, which is the probability for observing 1 by randomly flipping the coin. Problem 2: 1. Build the k nearest neighbor classifier that uses the leave one out approach for determining the appropriate k value. The training data can be found in file http://www.cse.msu.edu/~cse847/assignments/spam.train.txt. Each row in the file corresponds to a training data point. The last attribute of each data point is the class label (either +1 or -1) and the rest attributes are the input features. Submit your code and the best k value. 2. Under the same directory, you will find the test data in file http://www.cse.msu.edu/~cse847/assignments/spam.test.txt. The format of the test file is same as that for the training file except that it does not have class labels. Apply your k nearest neighbor model to predict class labels for the test data. Submit your prediction results on the test data in the format that each row is a predicted class label for a test data point. The grade is based on the classification accuracy of your kNN model on the test data. Problem 3: Consider N data points uniformly distributed in a p-dimensional unit ball centered at original. Consider the nearest neighbor estimate at the original. Prove that the mean distance from the original to the closest data point is: 1/ p 1   d ( p, N )   1  2 N    Furthermore, show that the above expression can be simplified as d ( p, N ) ~ 1  N >> 1, p >> 1, and p >> logN. log N when p Problem 4: In lecture 5, we discussed how to apply a Gaussian generative model to the problem of predicting the gender of an individual given his/her height. In that problem,     we claimed that given training data  h1m , h2m ,..., hNmm ; h1f , h2f ,..., hNf  , the maximum f N  Nm  N f likelihood solution is:  i 1 him , Nm m  Nm  hif  i 1 Nf f Nf  i 1 (him  m )2 , p Nm  m2  Nm  (hif  i 1 Nf ,  2f Nf   f )2 m  , pf  Nm N Nf N In the above, him represents the height of a man, and hi f represents the height of a woman. N m and N f stand for the population of male and female, respectively. N is the total number of people in this sample.  m ,  m ,  f ,  f are the mean and variance for men and women, respectively. pm and p f are the class priors for men and women, respectively. In this homework, you need to show that the above solution does maximize the likelihood of training data. Problem 5: Build and test a Naïve Bayes classifier using the same training and testing data as described in Problem 2. Assume each attribute of the input data follows a Gaussian distribution. Submit your code for training and testing the Naïve Bayes classifier and your classification results on the test data.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download hw2