Download hw2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Computational phylogenetics wikipedia , lookup

Theoretical computer science wikipedia , lookup

Psychometrics wikipedia , lookup

Data analysis wikipedia , lookup

Inverse problem wikipedia , lookup

Corecursion wikipedia , lookup

Data assimilation wikipedia , lookup

Pattern recognition wikipedia , lookup

Transcript
Homework 2
Due day: 02/10/2004
Problem 1: Consider that somebody has both a die and a coin. Each time, he can choose
to roll the die or to flip the coin. The decision is based on a Bernoulli distribution. The
observed number for flipping the coin is either 0 or 1, and the observed number for
rolling the die is either 0, 1, 2, 3, 4, or 5. By repeating this experiment for m time, you
will observed a sequence of numbers (n1, n2, …, nm). The observation data can be found
in http://www.cse.msu.edu/~cse847/assignments/mixture_data.txt.
a) Assume that each experiment is independent from the previous one. Meanwhile both
the coin and the die are fair ones, with equal probability for each possible number.
Compute the probability p in Bernoulli distribution using maximum likelihood estimation
that fits the observations.
b) Still assume that each experiment is independent. However, now we only know the die
is a fair one but not the coin. Take into count that the coin can be bias and re-compute
your answer for the probability p in Bernoulli distribution. You also need to the bias coin,
which is the probability for observing 1 by randomly flipping the coin.
Problem 2:
1. Build the k nearest neighbor classifier that uses the leave one out approach for
determining the appropriate k value. The training data can be found in file
http://www.cse.msu.edu/~cse847/assignments/spam.train.txt. Each row in the file
corresponds to a training data point. The last attribute of each data point is the class label
(either +1 or -1) and the rest attributes are the input features. Submit your code and the
best k value.
2. Under the same directory, you will find the test data in file
http://www.cse.msu.edu/~cse847/assignments/spam.test.txt. The format of the test file is
same as that for the training file except that it does not have class labels. Apply your k
nearest neighbor model to predict class labels for the test data. Submit your prediction
results on the test data in the format that each row is a predicted class label for a test data
point. The grade is based on the classification accuracy of your kNN model on the test
data.
Problem 3: Consider N data points uniformly distributed in a p-dimensional unit ball
centered at original. Consider the nearest neighbor estimate at the original. Prove that the
mean distance from the original to the closest data point is:
1/ p
1 

d ( p, N )   1  2 N 


Furthermore, show that the above expression can be simplified as d ( p, N ) ~ 1 
N >> 1, p >> 1, and p >> logN.
log N
when
p
Problem 4: In lecture 5, we discussed how to apply a Gaussian generative model to the
problem of predicting the gender of an individual given his/her height. In that problem,




we claimed that given training data  h1m , h2m ,..., hNmm ; h1f , h2f ,..., hNf  , the maximum
f
N  Nm  N f
likelihood solution is:
 i 1 him ,
Nm
m 
Nm
 hif
 i 1
Nf
f
Nf
 i 1 (him  m )2 , p
Nm
 m2 
Nm
 (hif
 i 1
Nf
,
 2f
Nf
  f )2
m

, pf 
Nm
N
Nf
N
In the above, him represents the height of a man, and hi f represents the height of a
woman. N m and N f stand for the population of male and female, respectively. N is the
total number of people in this sample.  m ,  m ,  f ,  f are the mean and variance for men
and women, respectively. pm and p f are the class priors for men and women,
respectively. In this homework, you need to show that the above solution does maximize
the likelihood of training data.
Problem 5: Build and test a Naïve Bayes classifier using the same training and testing
data as described in Problem 2. Assume each attribute of the input data follows a
Gaussian distribution. Submit your code for training and testing the Naïve Bayes
classifier and your classification results on the test data.