Download cs490test2fallof2012

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
1
CS 490 Test 2, Fall of 2012
Name:___________________________________________
Score:_____
40 questions, 2 pts. apiece. Score will be calculated as 100 – number of points off.
Ch. 5, Credibility: Evaluating What's Been Learned
1. Consider a data mining scheme that does prediction/classification. Let a training set and a
test set be given. Clearly describe what the purpose of the test set is. Do not explain how it's
used. Just state what useful end result you get from it.
2. Consider a classification problem with a training set and a test set. Explain what it means in
this context for the test set to be stratified.
3. Let a classification problem be given with 10 partitions. 10-fold cross-validation is an
iterative technique used in performance evaluation.
A. What is it that you do 10 times?
B. What is the final statistic that you calculate based on the 10 iterations?
4. There is a statistical test called the paired t-test that can be used in data mining performance
evaluation. State clearly what it is used for.
5. The sum shown below represents the quadratic loss function for a single instance where the
data mining algorithm gives a probability, p, of a certain classification, and a represents 1 if the
jth prediction is the correct one and 0 otherwise. In total, the quadratic loss function, or mean
square error, would be a double sum averaged over all of the instances. What is the goal with
respect to this function when searching for the best data mining scheme?
Σ (pj – aj)2
2
6. The expression for the informational loss function is shown below. The value p is the
probability given by the classification scheme for the ith class, which is the correct one. For a
data mining algorithm to get a good performance evaluation:
–log2 pi
A. pi should be:
B. –log2 pi should be:
7. Compare the quadratic loss function and the informational loss function with respect to the
data they are based on.
A. What do they have in common?
B. In what way do they differ?
8. In the direct mailing example, which kind of failure had a higher individual cost, a false
positive, or a false negative?
9. The book gave something known as the Kappa statistic for evaluating the performance of a
classifier. Two tables are shown below which contain the information needed to compute the
Kappa statistic. You do not have to answer with the Kappa statistic itself, although that would
be a correct answer. Using the data in the tables, make a clear statement which quantifies the
performance of the results of the predictor of interest in comparison with the hypothetical one.
actual a
actual b
pred.
Predictor in Question
pred. a
pred. b
90
10
10
90
100
100
actual
100
100
(200)
Hypothetical Predictor
pred. a
pred. b
actual a 50
50
actual b 50
50
pred.
100
100
actual
100
100
(200)
10. Statistical measures might show one data mining algorithm to give worse results than
another. In practice, why might the one with worse results be used?
3
11. Let a system that gives probabilities, p, of classification be considered. Assume that there
are known costs to misclassifications, c. Up until now, the underlying assumption has been that
if a straight classification was desired, the classification with the greatest value of p was chosen.
Under a cost-based alternative scheme for choosing, you could choose the classification where a
certain arithmetic expression was smallest. Give that arithmetic expression.
12. Consider a binary classification problem. There is a technique for improving a classifier by
taking the cost of a misclassification into account. In this technique, bias is introduced into the
training set. Explain concretely how this would be done in a case where false positives were
more expensive than false negatives.
13. Suppose you are using a data mining algorithm that gives probabilities of yes or no in a
binary situation. Suppose you order the training set instances, which include the actual
classification, from highest to lowest predicted probability of yes. Restrict your attention to the
top half of the list. Define the term "lift" in this scenario.
14. A lift chart for a direct mailing-like application is shown below.
A. What is the optimal point on the curve?
B. Explain why:
4
15. A lift chart and a cost-benefit curve for a direct mailing-like application are shown below.
A. Is the cost curve weighed down on the left by false positives or false negatives?
B. Is the cost curve weighed down on the right by false positives or false negatives?
16. A graph showing 2 ROC (receiver operating characteristic) curves is shown below. Make a
statement about making the choice between algorithm A and algorithm B. A correct answer will
not be stated in terms of a decision point based on False Positives (%). A correct answer will be
a more general statement about a decision point on a factor that is related to the x-axis based on
the definition of the ROC curve. Put your answer to the right of the figure.
17. Something roughly analogous a lift chart and an ROC curve is known as a recall-precision
curve. What problem domain is this used in?
18. An error curve for a binary classification algorithm A is shown below. The error curve is
linear. Make a simple correct statement about the performance of this algorithm.
5
19. A graph showing cost curves for binary classification algorithms A and B is shown below.
The graph allows you to choose between A and B in certain situations. In order to use the graph
in this way, you need an x-axis value. In practice, what do you have to do in order to obtain this
value?
20. State Occam's Razor in your own words.
Ch. 6, Implementations: Real Machine Learning Schemes
21. Two diagrams of pruning are shown below.
A. Which diagram illustrates subtree raising?
B. Which diagram illustrates subtree replacement?
(ii)
(i)
22. Consider pruning again.
A. Name the data set used to accomplish pruning.
B. What is the basis for deciding whether to prune?
6
23. Consider the measures of success for rule sets. Let p = positive, the number of instances
correctly classified by a rule; let t = total, the total number of instances covered by the rule.
A. Give a metric in the form of an arithmetic expression involving p and t which, if maximized,
would tend to select rules on the basis of degree of perfection only.
B. If P and T represent the case before a rule is added and p and t represent the case after the
rule is added, this is a possible metric for rule selection: p * (log p/t – log P/T). In addition to
degree of perfection, what other rule characteristic does this metric take into account when
selecting rules?
24. Theoretically, there is no perfect balance point between the two alternatives of the previous
question. However, in practice it is the case that imperfect rules tend to be better classifiers.
What is the empirical reason for this?
25. State what an FP-tree is for.
26. Given two different clusters, define a maximum margin hyperplane.
27. Given two different clusters and a maximum margin hyperplane, define a support vector.
28. This is a simple linear model: x = w1a1 + w2a2. It can be replaced by a more complicated
linear model: x = w1a13 + w2a12a2 w3a1a22 + w4a23.
A. In the more complicated model, what benefit is gained from the higher powers on the a's?
B. In the more complicated model, in short, what is the computational cost which is exemplified
by the w's?
29. Give any one desirable characteristic of a neural network (multi-layer perceptron) in a data
mining application.
7
30. Let a radial basis function approach in two dimensions be represented by a slice of Swiss
cheese. What do the holes represent?
31. In an instance based learning scheme you become concerned with the concept of exemplars.
A. What general benefit would you expect to get from retaining many exemplars?
B. What general shortcoming stems from retaining many exemplars?
32. The two question subparts below together encapsulate a simple scheme for retaining and
discarding exemplars. The scheme may seem counter-intuitive at first glance, but there is a logic
to why it works assuming that the data are not noisy.
A. Any new instance that arrives and is correctly classified by the existing set of exemplars is
not added to the set of exemplars. Why?
B. Any new instance that arrives and is not correctly classified by the existing set of exemplars
is added to the exemplars. Why?
33. Let a model tree be under consideration.
A. What does the process of data mining derive for each leaf of the tree?
B. How do you make a numeric prediction using a model tree?
34. This is a simple smoothing equation for a model tree: p' = (np + kq) / (n + k).
A. What is n?
B. What is k?
Do not overlook the questions on the back of this page.
8
35. In a Bayesian network, graphically represented, in terms of the attributes represented by 2
nodes, what statistical relationship does the arc between the nodes represent?
36. What is the problem with k-means clustering?
37. Consider the agglomerative clustering approach.
A. Define single linkage.
B. Define complete linkage.
38. In your own words, describe what all semi-supervised data mining algorithms have in
common. What means do they use in order to try and come up with better prediction schemes?
39. Consider the chemical molecule/drug example.
A. What were the multiple instances that belonged together (what was in the bag) for a given
chemical?
B. Under what condition was a molecule defined to be active?
40. This is your chance to sound off on any topic you would like to.