Download Extensive Report on Data Mining (Word Document)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Problem 5: k-Nearest Neighbors Data Mining for Finding Undecided Voters for Campaign
Organizers
For k = 1, why is the overall rate equal to 0 percent on the training set? Why isn’t the
overall rate equal to 0 percent on the validation set?
The error rate for the training set is calculated by comparing each training set observation to the
variable “k” most similar observations. In many cases, the most similar observation in the
training set is that observation itself (at k = 1), which will yield a very low error rate. The overall
rate in the validation set is not equal to 0 because validation set observations are compared to
many of the k most similar observations within the training set. At k = 1, an observation in the
validation set will most likely have a different classification from the most similar observation in
the training set.
For the cutoff probability value 0.5, what value of k minimizes the overall error rate on the
validation data? Explain the difference in the overall error rate on the training, validation,
and test data.
The value of k that minimizes the overall error rate on the validation data is k = 3. The error rate
is lowest overall on the training set data, since training set observations always include itself,
which provides for lower error rates. The overall error rate for the validation data can be tricky to
understand, as the overall error rate is the lowest error rate for all k values. If k = 3 is to be
applied to the test data, it will result in bigger error rates since the test data is not being used to
find k's best value.
Examine the decile-wise lift chart. What is the first decile lift on the test data? Interpret
this value.
The first decile lift on the test data is 2.29. Within the test data set are 2000 observations with
791 undecided voters. If we randomly selected 200 voters, we would find that on average, 79.1
percent would be undecided. After utilizing k-nearest neighbors with k = 3 to identify which 200
voters are most likely to be undecided, then 181.14 would actually be undecided (calculated from
79.1 * 2.29).
For cutoff probability values of 0.5, 0.4, 0.3, and 0.2, what are the corresponding Class 1
error rates and Class 0 error rates on the validation data?
Cutoff Value
0.5
0.4
0.3
0.2
Class 1 Error Rate
25.72%
25.72%
8.68%
8.68%
Class 0 Error Rate
17.36%
17.36%
36.80%
36.80%
Problem 6: Logistic Regression to Predict the Oscars
What is the resulting logistic regression calculation?
The resulting logistic regression calculation to help predict winners of the Best Picture Oscar is:
Log Odds of Winning Best Picture = -8.21 + (0.57 * OscarNominations) + (1.03 *
GoldenGlobeWins)
What is the overall error rate on the validation data?
The overall error rate on the validation data is 15%, with 6 errors out of 40 cases.
Use the model to score the new data (2011). Which movie did the model select as the most
likely to win the 2011 Best Picture Award?
The model selected “The Artist” as the most likely winner of the 2011 Best Picture Award at the
Oscars, with a win probability of 64.58%.