Survey

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CPSC445/545 Introduction to Data Mining Spring 2008
Homework 3 (Due: Thursday, March 6, 2008)
The following exercise can be easily done with the information provided in most of the
online R tutorials mentioned on http://www.r-project.org and covered in the second
lecture given by Jiang Du.
If you run into (unexpected) difficulties, it is fine to consult with one of your classmates
and/or Jiang. Please send your solutions to Jiang ([email protected]) by March 6 2008.
The file hw3_data.csv contains 1000 observations with two groups (Group 0 and Group
1) and two variables (x and y). You can load the data using the ``read.csv” command in R.
1. Plot all the data points in a 2-dimensional scatter plot. Mark Group 1 points and
Group 0 points differently (e.g., one with a 'x' and the other with 'o') so you can
visualize the distribution of the points of each Group.
2. Partition the data into training and classification sets with 600 and 400
observations respectively. How is this best done in an unbiased way?
3. Using R or Weka, compare the performance of:
o Logistic Regression
o Support Vector Machine (SVM)
o Neural Nets
o K-Nearest Neighbor Classifiers
How would you rank these methods and why?
Remember that logistic regression and SVM analysis are linear classifiers - i.e., it
separates points of different classes with a plane. In contrast, neural networks and
k-nearest neighbors allow non-linear classifiers (do you have an intuitive idea on
the geometry of how the latter two classifies points?).
4. For each method, plot a scatter plot for the best classifier. On each plot, display
the following series
o Group 0 points that are classified correctly,
o Group 0 points that are misclassified,
o Group 1 points that are classified correctly,
o Group 1 points that are misclassified.