Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Problem 5: k-Nearest Neighbors Data Mining for Finding Undecided Voters for Campaign Organizers For k = 1, why is the overall rate equal to 0 percent on the training set? Why isn’t the overall rate equal to 0 percent on the validation set? The error rate for the training set is calculated by comparing each training set observation to the variable “k” most similar observations. In many cases, the most similar observation in the training set is that observation itself (at k = 1), which will yield a very low error rate. The overall rate in the validation set is not equal to 0 because validation set observations are compared to many of the k most similar observations within the training set. At k = 1, an observation in the validation set will most likely have a different classification from the most similar observation in the training set. For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the validation data? Explain the difference in the overall error rate on the training, validation, and test data. The value of k that minimizes the overall error rate on the validation data is k = 3. The error rate is lowest overall on the training set data, since training set observations always include itself, which provides for lower error rates. The overall error rate for the validation data can be tricky to understand, as the overall error rate is the lowest error rate for all k values. If k = 3 is to be applied to the test data, it will result in bigger error rates since the test data is not being used to find k's best value. Examine the decile-wise lift chart. What is the first decile lift on the test data? Interpret this value. The first decile lift on the test data is 2.29. Within the test data set are 2000 observations with 791 undecided voters. If we randomly selected 200 voters, we would find that on average, 79.1 percent would be undecided. After utilizing k-nearest neighbors with k = 3 to identify which 200 voters are most likely to be undecided, then 181.14 would actually be undecided (calculated from 79.1 * 2.29). For cutoff probability values of 0.5, 0.4, 0.3, and 0.2, what are the corresponding Class 1 error rates and Class 0 error rates on the validation data? Cutoff Value 0.5 0.4 0.3 0.2 Class 1 Error Rate 25.72% 25.72% 8.68% 8.68% Class 0 Error Rate 17.36% 17.36% 36.80% 36.80% Problem 6: Logistic Regression to Predict the Oscars What is the resulting logistic regression calculation? The resulting logistic regression calculation to help predict winners of the Best Picture Oscar is: Log Odds of Winning Best Picture = -8.21 + (0.57 * OscarNominations) + (1.03 * GoldenGlobeWins) What is the overall error rate on the validation data? The overall error rate on the validation data is 15%, with 6 errors out of 40 cases. Use the model to score the new data (2011). Which movie did the model select as the most likely to win the 2011 Best Picture Award? The model selected “The Artist” as the most likely winner of the 2011 Best Picture Award at the Oscars, with a win probability of 64.58%.