Download CS 634 DATA MINING QUESTION 1 [Time Series Data Mining] (A

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CS 634 DATA MINING
QUESTION 1 [Time Series Data Mining]
(A) You are given two time series
Q = 0.2, 0.3, 0.2, 0.4, 0.1, 0.5
C = 0.3, 0.4, 0.3, 0.1, 0.3
Compute the optimal (minimum) warping path distance between Q and C by drawing the distance table and cumulative
distance table. Specifically, refer to the distance table d and the cumulative distance table γ below. You need to fill out the
values in the tables. What is the optimal (minimum) warping path distance you computed? Use the trace back procedure to
find the optimal warping path.
d(i, j)
0.2
0.3
0.2
0.4
0.1
0.5
0.3
γ(i, j)
0.4
0.3
0.1
0.3
0.3
0.4
0.3
0.1
0.3
0.2
0.3
0.2
0.4
0.1
0.5
(B) Given are the following 4 time series:
Q1 = 0.2, 0.3, 0.5, 0.4, 0.4, 0.7
Q2 = 0.3, 0.4, 0.5, 0.55, 0.65, 0.7, 0.9
Q3 = 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.4
Q4 = 0.35, 0.25, 0.3, 0.4, 0.6, 0.55, 0.3
For every pair of time series, calculate the optimal warping path and the distance of the optimal warping path between the
pair of time series. Note that, in this part (B), you do not need to draw the d and γ distance tables. However, to get full
credit, you must show all the optimal warping paths and the distance values you computed. Then, using the agglomerative
hierarchical clustering algorithm and the distance values you computed, show step by step how the agglomerative
algorithm clusters the 4 time series above into 2 clusters and present the final clustering result.
QUESTION 2 [Association Rule Mining]
Consider the transaction database shown in the following table.
Transaction ID
T100
T200
T300
T400
T500
Items Bought
A, B, C, D, E, F
G, B, C, D, E, F
A, H, D, E
A, K, M, D, F
M, B, D, N, E
Let minimum support be 60% and minimum confidence be 70%.
(A) Find all frequent itemsets using the Apriori algorithm. Show how the Apriori algorithm works in a step by step manner.
1
(B) List all the association rules found by the Apriori algorithm.
(C) Find all frequent itemsets using the FP-tree algorithm. Show the final FP-tree you constructed. Note that the FP-tree
algorithm has a pre-processing step, which sorts items in a transaction based on the support values of the items. If two
items have the same support value, they must be sorted in alphabetical (lexicographical) order in the transaction.
(D) List all the association rules found by the FP-tree algorithm.
(E) In this example, indicate whether the association rules produced by the Apriori algorithm are the same as those
produced by the FP-tree algorithm.
QUESTION 3 [Classification and Prediction]
(A) Consider the following training set in which there are four attributes and two classes. Build the decision tree and
generate all the classification rules from the tree. You have to show, step by step, how to calculate information gain, etc.
during the construction of the tree.
Outlook
sunny
sunny
sunny
sunny
sunny
overcast
overcast
overcast
overcast
overcast
rain
rain
rain
rain
rain
Temp (°F)
75
80
85
72
69
72
83
64
81
55
71
65
75
68
70
Humidity (%)
70
90
85
95
70
90
78
65
75
85
80
70
80
80
96
Windy?
true
true
false
false
false
true
false
true
false
true
true
true
false
false
false
Class
Play
Don't Play
Don't Play
Don't Play
Play
Play
Play
Play
Play
Play
Don't Play
Don't Play
Play
Play
Play
(B) Consider the task of building a decision tree classifier from random data, where the attribute values are generated
randomly irrespective of the class labels. Assume the data set contains records from two classes, “+” and “−”. Half of the
data set is used for training while the remaining half is used for testing. Suppose there are an equal number of positive and
negative records in the data and the decision tree classifier predicts every test record to be positive. What is the expected
error rate of the classifier on the test data?
QUESTION 4 [Web Mining]
Given is the following table of linked web pages, which form a neighborhood graph.
Web page
A
B
C
D
E
F
G
Linked to the web page
C, E, F
C, F
A, D, F
B, C
A, D
C, E, G
A, B, C
(A) Find the best authority in the neighborhood graph using four iterations of the HITS (Hyperlink-Induced Topic Search)
algorithm.
(B) Find the best hub in the neighborhood graph using four iterations of the HITS algorithm.
2
To answer the above questions, you have to calculate and show α(ρ) and λ(ρ) for each web page ρ, where α(ρ) is the
authority weight of ρ and λ(ρ) is the hub weight of ρ, after the 4th iteration of the HITS algorithm. (There is no need to
consider normalization or convergence problems here.)
3