Download CPSC445/545 Introduction to Data Mining Spring 2008

CPSC445/545 Introduction to Data Mining Spring 2008 Homework 3 (Due: Thursday, March 6, 2008) The following exercise can be easily done with the information provided in most of the online R tutorials mentioned on http://www.r-project.org and covered in the second lecture given by Jiang Du. If you run into (unexpected) difficulties, it is fine to consult with one of your classmates and/or Jiang. Please send your solutions to Jiang ([email protected]) by March 6 2008. The file hw3_data.csv contains 1000 observations with two groups (Group 0 and Group 1) and two variables (x and y). You can load the data using the ``read.csv” command in R. 1. Plot all the data points in a 2-dimensional scatter plot. Mark Group 1 points and Group 0 points differently (e.g., one with a 'x' and the other with 'o') so you can visualize the distribution of the points of each Group. 2. Partition the data into training and classification sets with 600 and 400 observations respectively. How is this best done in an unbiased way? 3. Using R or Weka, compare the performance of: o Logistic Regression o Support Vector Machine (SVM) o Neural Nets o K-Nearest Neighbor Classifiers How would you rank these methods and why? Remember that logistic regression and SVM analysis are linear classifiers - i.e., it separates points of different classes with a plane. In contrast, neural networks and k-nearest neighbors allow non-linear classifiers (do you have an intuitive idea on the geometry of how the latter two classifies points?). 4. For each method, plot a scatter plot for the best classifier. On each plot, display the following series o Group 0 points that are classified correctly, o Group 0 points that are misclassified, o Group 1 points that are classified correctly, o Group 1 points that are misclassified.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CPSC445/545 Introduction to Data Mining Spring 2008