Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 CS 490 Test 2, Fall of 2012 Name:___________________________________________ Score:_____ 40 questions, 2 pts. apiece. Score will be calculated as 100 – number of points off. Ch. 5, Credibility: Evaluating What's Been Learned 1. Consider a data mining scheme that does prediction/classification. Let a training set and a test set be given. Clearly describe what the purpose of the test set is. Do not explain how it's used. Just state what useful end result you get from it. 2. Consider a classification problem with a training set and a test set. Explain what it means in this context for the test set to be stratified. 3. Let a classification problem be given with 10 partitions. 10-fold cross-validation is an iterative technique used in performance evaluation. A. What is it that you do 10 times? B. What is the final statistic that you calculate based on the 10 iterations? 4. There is a statistical test called the paired t-test that can be used in data mining performance evaluation. State clearly what it is used for. 5. The sum shown below represents the quadratic loss function for a single instance where the data mining algorithm gives a probability, p, of a certain classification, and a represents 1 if the jth prediction is the correct one and 0 otherwise. In total, the quadratic loss function, or mean square error, would be a double sum averaged over all of the instances. What is the goal with respect to this function when searching for the best data mining scheme? Σ (pj – aj)2 2 6. The expression for the informational loss function is shown below. The value p is the probability given by the classification scheme for the ith class, which is the correct one. For a data mining algorithm to get a good performance evaluation: –log2 pi A. pi should be: B. –log2 pi should be: 7. Compare the quadratic loss function and the informational loss function with respect to the data they are based on. A. What do they have in common? B. In what way do they differ? 8. In the direct mailing example, which kind of failure had a higher individual cost, a false positive, or a false negative? 9. The book gave something known as the Kappa statistic for evaluating the performance of a classifier. Two tables are shown below which contain the information needed to compute the Kappa statistic. You do not have to answer with the Kappa statistic itself, although that would be a correct answer. Using the data in the tables, make a clear statement which quantifies the performance of the results of the predictor of interest in comparison with the hypothetical one. actual a actual b pred. Predictor in Question pred. a pred. b 90 10 10 90 100 100 actual 100 100 (200) Hypothetical Predictor pred. a pred. b actual a 50 50 actual b 50 50 pred. 100 100 actual 100 100 (200) 10. Statistical measures might show one data mining algorithm to give worse results than another. In practice, why might the one with worse results be used? 3 11. Let a system that gives probabilities, p, of classification be considered. Assume that there are known costs to misclassifications, c. Up until now, the underlying assumption has been that if a straight classification was desired, the classification with the greatest value of p was chosen. Under a cost-based alternative scheme for choosing, you could choose the classification where a certain arithmetic expression was smallest. Give that arithmetic expression. 12. Consider a binary classification problem. There is a technique for improving a classifier by taking the cost of a misclassification into account. In this technique, bias is introduced into the training set. Explain concretely how this would be done in a case where false positives were more expensive than false negatives. 13. Suppose you are using a data mining algorithm that gives probabilities of yes or no in a binary situation. Suppose you order the training set instances, which include the actual classification, from highest to lowest predicted probability of yes. Restrict your attention to the top half of the list. Define the term "lift" in this scenario. 14. A lift chart for a direct mailing-like application is shown below. A. What is the optimal point on the curve? B. Explain why: 4 15. A lift chart and a cost-benefit curve for a direct mailing-like application are shown below. A. Is the cost curve weighed down on the left by false positives or false negatives? B. Is the cost curve weighed down on the right by false positives or false negatives? 16. A graph showing 2 ROC (receiver operating characteristic) curves is shown below. Make a statement about making the choice between algorithm A and algorithm B. A correct answer will not be stated in terms of a decision point based on False Positives (%). A correct answer will be a more general statement about a decision point on a factor that is related to the x-axis based on the definition of the ROC curve. Put your answer to the right of the figure. 17. Something roughly analogous a lift chart and an ROC curve is known as a recall-precision curve. What problem domain is this used in? 18. An error curve for a binary classification algorithm A is shown below. The error curve is linear. Make a simple correct statement about the performance of this algorithm. 5 19. A graph showing cost curves for binary classification algorithms A and B is shown below. The graph allows you to choose between A and B in certain situations. In order to use the graph in this way, you need an x-axis value. In practice, what do you have to do in order to obtain this value? 20. State Occam's Razor in your own words. Ch. 6, Implementations: Real Machine Learning Schemes 21. Two diagrams of pruning are shown below. A. Which diagram illustrates subtree raising? B. Which diagram illustrates subtree replacement? (ii) (i) 22. Consider pruning again. A. Name the data set used to accomplish pruning. B. What is the basis for deciding whether to prune? 6 23. Consider the measures of success for rule sets. Let p = positive, the number of instances correctly classified by a rule; let t = total, the total number of instances covered by the rule. A. Give a metric in the form of an arithmetic expression involving p and t which, if maximized, would tend to select rules on the basis of degree of perfection only. B. If P and T represent the case before a rule is added and p and t represent the case after the rule is added, this is a possible metric for rule selection: p * (log p/t – log P/T). In addition to degree of perfection, what other rule characteristic does this metric take into account when selecting rules? 24. Theoretically, there is no perfect balance point between the two alternatives of the previous question. However, in practice it is the case that imperfect rules tend to be better classifiers. What is the empirical reason for this? 25. State what an FP-tree is for. 26. Given two different clusters, define a maximum margin hyperplane. 27. Given two different clusters and a maximum margin hyperplane, define a support vector. 28. This is a simple linear model: x = w1a1 + w2a2. It can be replaced by a more complicated linear model: x = w1a13 + w2a12a2 w3a1a22 + w4a23. A. In the more complicated model, what benefit is gained from the higher powers on the a's? B. In the more complicated model, in short, what is the computational cost which is exemplified by the w's? 29. Give any one desirable characteristic of a neural network (multi-layer perceptron) in a data mining application. 7 30. Let a radial basis function approach in two dimensions be represented by a slice of Swiss cheese. What do the holes represent? 31. In an instance based learning scheme you become concerned with the concept of exemplars. A. What general benefit would you expect to get from retaining many exemplars? B. What general shortcoming stems from retaining many exemplars? 32. The two question subparts below together encapsulate a simple scheme for retaining and discarding exemplars. The scheme may seem counter-intuitive at first glance, but there is a logic to why it works assuming that the data are not noisy. A. Any new instance that arrives and is correctly classified by the existing set of exemplars is not added to the set of exemplars. Why? B. Any new instance that arrives and is not correctly classified by the existing set of exemplars is added to the exemplars. Why? 33. Let a model tree be under consideration. A. What does the process of data mining derive for each leaf of the tree? B. How do you make a numeric prediction using a model tree? 34. This is a simple smoothing equation for a model tree: p' = (np + kq) / (n + k). A. What is n? B. What is k? Do not overlook the questions on the back of this page. 8 35. In a Bayesian network, graphically represented, in terms of the attributes represented by 2 nodes, what statistical relationship does the arc between the nodes represent? 36. What is the problem with k-means clustering? 37. Consider the agglomerative clustering approach. A. Define single linkage. B. Define complete linkage. 38. In your own words, describe what all semi-supervised data mining algorithms have in common. What means do they use in order to try and come up with better prediction schemes? 39. Consider the chemical molecule/drug example. A. What were the multiple instances that belonged together (what was in the bag) for a given chemical? B. Under what condition was a molecule defined to be active? 40. This is your chance to sound off on any topic you would like to.