* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Mining Multi-label Data by Grigorios Tsoumakas, Ioannis Katakis
Algorithm characterizations wikipedia , lookup
Lateral computing wikipedia , lookup
Data analysis wikipedia , lookup
Factorization of polynomials over finite fields wikipedia , lookup
Machine learning wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Selection algorithm wikipedia , lookup
Dijkstra's algorithm wikipedia , lookup
Genetic algorithm wikipedia , lookup
Data assimilation wikipedia , lookup
Inverse problem wikipedia , lookup
Computational complexity theory wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Corecursion wikipedia , lookup
MINING MULTI-LABEL DATA BY GRIGORIOS TSOUMAKAS, IOANNIS KATAKIS, AND IOANNIS VLAHAVAS Published on July, 7, 2010 Team Members: Kristopher Tadlock, Jimmy Sit, Kevin Mack BACKGROUND AND PROBLEM DEFINITION • “A large body of research in supervised learning deals with the analysis of single label data, where training examples are associated with a single label l from a set of disjoint labels L. However, training examples in several application domains are often associated with a set of labels Y (union) L. Such data are called multi-label” (Tsoumakas et al). • Applications in ranking web pages. Web pages are often multi labeled. For example “cooking” and “food network” and “iron chef” might all apply to the same page. How do you rank and classify that along other pages that have some of the same labels, but not all of the same labels? TECHNICAL HIGHLIGHTS OF PROBLEM SOLVING • Problem Transformation: divide the problem into several single label problems and solve them using known algorithms. • Algorithm Adaption: Change an existing algorithm so you can use it on a multi label problem. • Dimensionality Reduction: Reduce the number of random variables in the data set or reduce the number of dimensions in the labels. The goal here to remove white noise so you can focus on the relationships that matter or concern you. • Evaluation Measures: How good of a job did you do? How accurately does your model classify examples? • Ex. How often are labels miss classified? How often does a less important label get a higher rank than a more important label? ILLUSTRATION OF METHODS INTRODUCED • Problem Transformation: divide the problem into several single label problems and solve them using known algorithms. • Label Powerset - Treats multi labels as if they are a single label, and then ranks them according to highest support. Similar to first step of Apori. • Binary Relevance – Assigns a + or - classifier to each label. If an instance has that label it gets a +, if not it gets a -. Similar to 1 rule. • Algorithm Adaption: Change an existing algorithm so you can use it on a multi label problem. • Adapted C4.5 Tree – Multi labels are leaves, and multi labels are ordered so as to reduce entropy • where p(l j) = relative frequency of class l j and q(l j) = 1¡ p(l j). • ML – KNN: Same as regular KNN, choose x nearest neighbors, and then use an aggregation algorithm like ML. ML uses the posteriror principle, which is concerned with what can be known about the data set without learning (prior probabilities) and after learning (posterior probabilities). DIMENSIONALITY REDUCTION TECHNIQUES • Feature Selection: Select a subset of the dimensions for some purpose. Ex. To minimize a loss function • Wrapper – A guided search of feature set. Select based on some criteria • Filter – Look for something specific in the data set. Ex. An informed search based on the result of LP learning. • Feature Extraction: Transform the data set into a lower dimensional data set using some algorithm or reasoning. Uses various statistical and linear algebra techniques. • Exploiting Label Structure: Create a general to specific tree. An example cannot be associated with a label (some leaf node) without being associated with its parent nodes. Create a relationship be tracing the path from root to leaf. SCALING UP PROBLEMS If you are analyzing a complicated data set with thousands of labels you can run into problems. • Number of training examples is significantly more than the number of actual examples: far more combinations of labels and classifiers than actual labels and classifiers. Output is to complex and not helpful. • Problem complexity and performance: It can take too much memory and/or processing time to classify everything. • One technique for reducing complexity is to use a hierarchy tree, such as HOMER (hierarchy of multi label classifiers). Each layer is a subset of the labeled data set. Uses predictive algorithms to decide how to divide up the data set as you descend down the tree. MULTI LABEL DATA MINING SOFTWARE BoosTexter Matlab Mulan TEAM’S OPINION ON THE METHOD/RESEARCH WORK • Thorough coverage of the spectrum of techniques and considerations in multi label data mining. • Great place to discover new techniques and algorithms. • Didn’t go into very much detail into any one algorithm. Seemed best suited as a resource or an introduction. • Didn’t compare or contrast methods. Wasn’t sure when to use problem transformation, algorithm adaption, or reduction.