* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CS 634 DATA MINING QUESTION 1 [Time Series Data Mining] (A
Survey
Document related concepts
Transcript
CS 634 DATA MINING QUESTION 1 [Time Series Data Mining] (A) You are given two time series Q = 0.2, 0.3, 0.2, 0.4, 0.1, 0.5 C = 0.3, 0.4, 0.3, 0.1, 0.3 Compute the optimal (minimum) warping path distance between Q and C by drawing the distance table and cumulative distance table. Specifically, refer to the distance table d and the cumulative distance table γ below. You need to fill out the values in the tables. What is the optimal (minimum) warping path distance you computed? Use the trace back procedure to find the optimal warping path. d(i, j) 0.2 0.3 0.2 0.4 0.1 0.5 0.3 γ(i, j) 0.4 0.3 0.1 0.3 0.3 0.4 0.3 0.1 0.3 0.2 0.3 0.2 0.4 0.1 0.5 (B) Given are the following 4 time series: Q1 = 0.2, 0.3, 0.5, 0.4, 0.4, 0.7 Q2 = 0.3, 0.4, 0.5, 0.55, 0.65, 0.7, 0.9 Q3 = 0.1, 0.2, 0.25, 0.3, 0.4, 0.5, 0.4 Q4 = 0.35, 0.25, 0.3, 0.4, 0.6, 0.55, 0.3 For every pair of time series, calculate the optimal warping path and the distance of the optimal warping path between the pair of time series. Note that, in this part (B), you do not need to draw the d and γ distance tables. However, to get full credit, you must show all the optimal warping paths and the distance values you computed. Then, using the agglomerative hierarchical clustering algorithm and the distance values you computed, show step by step how the agglomerative algorithm clusters the 4 time series above into 2 clusters and present the final clustering result. QUESTION 2 [Association Rule Mining] Consider the transaction database shown in the following table. Transaction ID T100 T200 T300 T400 T500 Items Bought A, B, C, D, E, F G, B, C, D, E, F A, H, D, E A, K, M, D, F M, B, D, N, E Let minimum support be 60% and minimum confidence be 70%. (A) Find all frequent itemsets using the Apriori algorithm. Show how the Apriori algorithm works in a step by step manner. 1 (B) List all the association rules found by the Apriori algorithm. (C) Find all frequent itemsets using the FP-tree algorithm. Show the final FP-tree you constructed. Note that the FP-tree algorithm has a pre-processing step, which sorts items in a transaction based on the support values of the items. If two items have the same support value, they must be sorted in alphabetical (lexicographical) order in the transaction. (D) List all the association rules found by the FP-tree algorithm. (E) In this example, indicate whether the association rules produced by the Apriori algorithm are the same as those produced by the FP-tree algorithm. QUESTION 3 [Classification and Prediction] (A) Consider the following training set in which there are four attributes and two classes. Build the decision tree and generate all the classification rules from the tree. You have to show, step by step, how to calculate information gain, etc. during the construction of the tree. Outlook sunny sunny sunny sunny sunny overcast overcast overcast overcast overcast rain rain rain rain rain Temp (°F) 75 80 85 72 69 72 83 64 81 55 71 65 75 68 70 Humidity (%) 70 90 85 95 70 90 78 65 75 85 80 70 80 80 96 Windy? true true false false false true false true false true true true false false false Class Play Don't Play Don't Play Don't Play Play Play Play Play Play Play Don't Play Don't Play Play Play Play (B) Consider the task of building a decision tree classifier from random data, where the attribute values are generated randomly irrespective of the class labels. Assume the data set contains records from two classes, “+” and “−”. Half of the data set is used for training while the remaining half is used for testing. Suppose there are an equal number of positive and negative records in the data and the decision tree classifier predicts every test record to be positive. What is the expected error rate of the classifier on the test data? QUESTION 4 [Web Mining] Given is the following table of linked web pages, which form a neighborhood graph. Web page A B C D E F G Linked to the web page C, E, F C, F A, D, F B, C A, D C, E, G A, B, C (A) Find the best authority in the neighborhood graph using four iterations of the HITS (Hyperlink-Induced Topic Search) algorithm. (B) Find the best hub in the neighborhood graph using four iterations of the HITS algorithm. 2 To answer the above questions, you have to calculate and show α(ρ) and λ(ρ) for each web page ρ, where α(ρ) is the authority weight of ρ and λ(ρ) is the hub weight of ρ, after the 4th iteration of the HITS algorithm. (There is no need to consider normalization or convergence problems here.) 3