Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Chapter 5: A Systematic Overview of Data Mining Algorithms Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Data mining algorithm components • Mining tasks: the purpose of the data mining • structures: the functional form of the model or pattern that is used for fitting the data. • Score function: used for judging the quality of the fitted models or patterns. • Optimization and search method: used for searching over parameters or structures to find the one have the best quality w.r.t. the score function • Data management techniques: used for handling data access efficiently during the mining process. Data Mining: Chapter 5 Why adopt such systematic view? Such systematic point of view (Reductionist) emphasizes the fundamental properties of an algorithm avoiding the tendency to think of lists of algorithms Algorithm = {model structure, score function, search method, data management technique} • Analysis Clarify the role of each component and make it easier to compare competing algorithms • Synthesis Build novel data mining algorithms with different properties by combining different components in various way Data Mining: Chapter 5 Case studies CART BP Neural Network A Priori Vector-Space Task Classification & Regression Classification & Regression Rule Pattern Discovery Content-based Retrieval Structure Decision Tree Neural Network Association Rule Vector of occurrence Score Function Cross-validated Loss Function Squared Error Support / Accuracy Angle between two vector Search Method Greedy search over Structures Gradient Descent on Parameters Breadth-First with Pruning Various techniques Data Managemen t Techniques Unspecified Unspecified Linear Scans Various fast indexing strategies Data Mining: Chapter 5 Case 1: CART -- overview CART tries to construct a decision tree from the data General procedure: 1. Construct a root node which contains the whole data set 2. Selecting an attribute that benefit the task most according to some criterion computed using the data within the current node (e.g., gini index for CART and GainRatio for C4.5) 3. Split the examples of the current node into different subsets based on values of the selected attributes 4. Create a new node as a child of the current node for each subset and passes the examples in the subset to the node 5. Recursively repeat step 2~5 until some stopping criterion is met (e.g., minNumPts, predictive ability) 6. Pruning of the tree is usually required to avoid overfitting Data Mining: Chapter 5 Case 1: CART – model structure The model structure used in CART is a decision tree Decision tree is a flow-chart-like tree structure: – each internal node is a test on an attribute – each branch represents an outcome of the test – each leaf is a class or class distribution The path from the root to the leaf leads an unseen instance to its prediction Data Mining: Chapter 5 Case 1: CART – Score function The misclassification error / MSE is used as the score function for a given instantiation of the model structure (i.e., the classification & regression tree) Regression classification Cross-validation is used to estimate this misclassification error / MSE Data Mining: Chapter 5 Case 1: CART – Search Method Two-phase greedy search over tree structures with different complexity • Growing Recursively expanding the tree from a root node to a large tree by greedily selecting the best attribute for split for the data arrived in the current node. • Pruning Recursively pruning back specific branches of this large tree (in order to avoid overfitting) Data Mining: Chapter 5 Case 1: CART – data management technique • Unspecified. • Assumption for CART: Data are all in main memory • Any data management technique that facilitate learning process can be used Data Mining: Chapter 5 Case 2: Backpropagation -- overview • Backpropagation (BP) is a common method of teaching artificial neural networks how to perform a given task • General Procedure: 1. 2. 3. 4. 5. 6. Present a training sample to the neural network. Compare the network's output to the desired output from that sample. Calculate the error in each output neuron. For each neuron, calculate what the output should have been, and a scaling factor, how much lower or higher the output must be adjusted to match the desired output. This is the local error. Adjust the weights of each neuron to lower the local error. Assign "blame" for the local error to neurons at the previous level, giving greater responsibility to neurons connected by stronger weights. Repeat from step 3 on the neurons at the previous level, using each one's "blame" as its error. Data Mining: Chapter 5 Case 2: BP – model structure The model structure used in BP is a multi-layer perceptrons, non-linear transformation of weighted sums of the input • Multi-layer perceptrons is a type of feedforward neural network. – Each unit is only connected with the units in its next layer – Full connection between layers • Input layer takes the attribute value as the input signal and units of the output laryer give the prediction • The hidden layer process the input signal and produces intermediate results to the output layer. weights Output layer Hidden layer Input layer A multi-layer perceptrons with one hidden layer can approximate any function at any accuracy Data Mining: Chapter 5 Case 2: BP – Score function • The sum of the squared errors (squared loss) is used as the score function. – The SSE score function is widely used in BP. – In fact, any differentiable function can be used to replace this score function. (e.g., log-likelihood). – In classification , it is a upper bound for misclassification error (0-1 error) Data Mining: Chapter 5 Case 2: BP – Search method Greedy search over the model parameters with the model complexity fixed. • Bridging the model instantiation with the score function – Model parameters: all the weights that connect the units in different layers – The output is a non-linear transformation of the weighted sum of input, which can be computed from the input to the output layer-by-layer. – Plugging the output into the score function yielding a new a non-linear function whose variables are the model parameters. • Gradient descent over the augmented score function is used to find the parameters in that give the minimum value of the score function Data Mining: Chapter 5 Case 2: BP – Data management technique • Unspecified • Two different way to train a multi-layer perceptrons with BP – Online – Batch • Any data management technique can be chosen to facilitate the two types of training strategy. Data Mining: Chapter 5 Case 3: Apriori -- overview Apriori is a popular algorithm that discovers association rules from the data set • General procedure: 1. Identifying all frequent itemsets 2. Generating strong association rules by splitting the frequent set into two subsets and places the subsets on different side of “” such that the rule is “accurate” and “frequently observed” Data Mining: Chapter 5 Case 3: Apriori – Model structure The model structure used in Apriori is a set of association rules with confidence and support A = 1 AND B=1 C = 1 [ps, pc] • Support: how frequently the data items satisfying the precondition of rule appear in the database • Confidence: how likely the discovered rule is “correct” Data Mining: Chapter 5 Case 3: Apriori – Score function • 0-1 Score function is used. A pattern gets a score of 1 if it satisfies both of the threshold of conditions and gets a score of 0 otherwise. – The score function is computed for each rule. – The support and confidence are estimated over the dataset. Data Mining: Chapter 5 Case 3: Apriori – Search method Breadth-First search with Pruning over the rule structure (in fact, the frequent itemsets) • Breadth-first search: – Finding all frequent itemsets containing 2 items – Generating all frequent itemsets containing 3 items from the frequent 2itemsets. – Repeat until no frequent itemsets are found Exponential complexity! • Pruning – Key idea: any non-empty subset of a frequent k-itemset is a frequent itemset. – Do not generate the frequent k-itemset if some of its subset is not frequent. – Since data are usually sparse, the pruning can be very effective. Data Mining: Chapter 5 Case 3: Apriori – data management techniques • Multiple linear scans of database – To find frequent k-itemsets, a linear scan of the records in database is required in order to compute the support for each candidate itemset. (k = 2, 3, …) – To generate rule using the identified frequent itemsets, a linear scan of the records in database is required to compute the confidence of each resulting association rule. Data Mining: Chapter 5 Case 4: Vector-space Algorithm • Tasks: Retrieval of the k most similar documents in the database related to the query. • Model: Vectors of term occurrences • Score function: Cosine similarity (in fact, any similarity that reveals the semantics of the users can be used ) • Search method: various method • Data Management techniques: fast indexing Data Mining: Chapter 5 In summary • In the reductionist (component-based) point of view: – Data mining algorithms are combination of different components – All components are essential – The relative importance of different component in the algorithm varies from problem to problem, and from task to task. • Such a point of view offer a platform for – systematically describing data mining algorithms (same representation format) – systematically comparing data mining algorithms (w.r.t. certain component). – Systematically designing novel data mining algorithms for certain problems (assembling different components) Data Mining: Chapter 5 Let’s move to Chapter 6 Data Mining: Chapter 5