Download Concept Ontology for Text Classification

Concept Ontology for Text Classification Bhumika Thakker Outline        Introduction Techniques Used Shrinkage in a Hierarchy of Classes Classification using very few words Enhanced Word Clustering for Hierarchical Text Classification Classification of Web Content References Introduction An explosion in availability of online information with millions of documents on every topic easily accessible via the internet  The inability of people to assimilate and profitably utilize large amount of information is apparent due to the increase in available information  It becomes significant to organize this mass of information  Techniques for Classification Shrinkage in a Hierarchy of Classes  Classification using very few words  Word Clustering for Hierarchical Text Classification  Shrinkage in a Hierarchy of Classes    Technique leverages commonly available topic hierarchies in order to significantly improve classification accuracy Works well even when hierarchy is large and the training data set is sparse A method for exponentially reducing the amount of computation necessary for classification, sacrificing only a small amount of accuracy Shrinkage in a Hierarchy of Classes     The approach is to apply a technique from Statistics called shrinkage It provides improved estimates of parameters The technique exploits a hierarchy by “shrinking” parameter estimates in data sparse children toward the estimates of data-rich ancestors Employs simple form of shrinkage that creates new parameter estimates in a child by a linear interpolation of all hierarchy nodes from the child to the root Shrinkage in a Hierarchy of Classes Shrinkage for Text Classification     It is used to better estimate the probability of a word given a class θjt For each node in tree a maximum likelihood (ML) estimate based on the data associated with that node is constructed An improved estimate for each leaf node is then derived by “shrinking” its ML estimate towards the ML estimate of all its ancestors A unigram model for each node in the tree is build and smoothed each leaf model by linearly interpolating it with all the models found along the path to the root Cont…     The estimates along the path from the leaf to the root represent a tradeoff between specificity and reliability The estimate at the leaf is the most specific as it is based on the data from the topic alone Its least reliable as it is based on the least amount of data The estimator at the root is the most reliable but the least specific Cont…    Subtract each child’s data from its parent’s before calculating the parent’s ML estimate to ensure that the ML estimates along a given path are independent The estimate is based on the data that belongs to all the siblings of said child but not to the child itself Thus for any path from leaf to root every datum in the tree is used exactly one of the ML estimates providing both independence among the estimates and efficient use of the training data Determining the weights    Let be k estimates where is the estimate at the leaf, and k-1 is the depth of class cj in the tree The interpolation weights among the ancestors of the class cj are written where for the new estimate of the class conditioned word probabilities based on shrinkage is: The likelihood of data according to the mixture model is a convex function of weights and attains a single global maximum. This maximum for each leaf class cj, is calculated using the following procedure     The algorithm can be viewed as a particularly simple form of EM (Expectation maximization) Each datum is assumed to have been generated by first choosing one of the tree nodes in the path to the root say , using that estimate to generate the datum EM then maximizes the total likelihood when the choices of estimates made for the various data are unknown The first step in the iterative part is thus the E step and the second one is the M step      The method makes inefficient use of the available training data by carving off some of it to be used as a held-out set To overcome this problem the algorithm is modified as follows: All the available data is used both to construct the ML estimates and to optimize the weights As each document is used in the above algorithm, the ML estimates are modified to exclude its data so as to make them independent of it This method is known as “leave-one-out” or “jacknifing” Experimental Results      The following figure shows classification accuracy on the industry sector data set with 50-50 train-test splits while varying vocabulary size Larger vocabulary sizes generally perform better Hierarchical Feature Selection somewhat improves the performance of the flat naive Bayes in mid range of feature selection at about 5000 words Hierarchical feature selection accuracy reaches 64% Shrinkage improves classification accuracy of 74% Shrinkage helps more when training data is sparse     Following figure shows accuracy on the Newsgroups data set with full vocabulary and varying amount of training data Accuracy in this domain is highest with no feature selection for both classifiers with small amount of training data Shrinkage provides more improvement when the amount of training data is small and the shrinkage reduces variance in the classifications Estimates are improved by using shrinkage to smooth a class’s parameters with its ancestors    Following figure shows classification accuracy on Science hierarchy as a function of vocabulary size Best performance for both the classifier is equal Among 51 with less than 50 training documents shrinkage provides 6% improvement in accuracy, 39% for flat to 45% for hierarchical Pruning Tree for Increased computational Efficiency The class hierarchy can be leveraged to improve computational efficiency  The classifier can avoid calculating for a majority of the classes by pruning the tree dynamically during the classification of each document  Classification using very few words    Categorizing the different documents according to their topic, where topics are organized in a hierarchy of increasing specificity The bottleneck in this classification tasks is the need for a person to read each document and decide on its appropriate place in the hierarchy This is avoided by automatically classifying new documents using machine learning techniques      The approach is to divide the classification task into a set of smaller classification tasks Each corresponds to some split in classification hierarchy The key insight is that each subtask is significantly simpler than original task The classifier at each node in the hierarchy needs to be distinguished between small number of categories This is possible using small set of features The ability to restrict to a very small feature set avoids many of the difficulties  Such models are more robust and less subject to overfitting  Thus they achieve better accuracy even for a very simple classifier such as Naive Bayes      Key note is not merely the use of feature selection but its integration with the hierarchical structure Choosing small set of feature would not give good performance if used for classification for flattened class space In hierarchical approach any document only sees a small fraction of the features throughout the process The feature which it does see are divided so as to focus the attention of the classifier on the features relevant to the classification subtask at hand Probabilistic Framework   Constructing a hierarchical set of classifier, each based on its own set of relevant features It uses two main subroutines:  A feature selection algorithm for deciding on the appropriate feature set at each decision point  A supervised learning algorithm for constructing a classifier for that decision Bayesian Classifiers    A Bayesian network allows us to provide compact descriptions of complex distributions over a large number of random variables It uses directed acyclic graph to encode conditional independence assumptions about the domain Independent assumptions allow the distribution to be described as a product of small local interaction models     Bayesian classifier is simply a Bayesian network applied to a classification domain Contains a node C for the class variable and a node Xi for each of the features Specific instance x, the Bayesian network allows to compute the probability for each possible ck Bayes Optimal classification can be achieved by simply selecting class ck for which this probability is maximized Feature Selection    Feature selection employs Information Theoretic measures to determine a subset of the original domain features that seem to best capture the class distribution in the data For each feature Xi the algorithm determines the expected cross entropy: where is the set of all domain features except Xi It eliminates the feature Xi for which is minimized Feature Selection The feature eliminated least disrupts the original conditional class distribution  This process can be iterated to eliminate as many features as desired  In this respect the algorithm is very applicable to text domains with many features  Experimental Results Both hierarchical and flat classification schemes are ran on the datasets without employing any probabilistic feature selection  Original number of features in each data set and results are given in the following table   Two important phenomena are observed  In Hier1 dataset, the very large number of features used precludes the hierarchical scheme from performing better than simple flat method  In Hier2 dataset, the large number of features and small dataset size allows for more expressive KDB algorithm to overfit the training data  These results provide an empirical motivation for integration of feature selection The table results shows that hierarchical method clearly outperforms the flat classification method when considering a direct comparison of the 10 and 20 feature runs  Hierarchical method produces 8-41% fewer errors than flat methods for Hier1 and more modest relative gains for Hier2      The results show that the feature selection stage does serve to focus the algorithm on the features relevant to the local classification task The table shows the set of 10 features found to be most discriminating at each level of hierarchy learned for the Hier1 dataset At top level of the hierarchy, High level terms are selected from various major topics More specific words distinguishing the subtopics are seen Enhanced Word Clustering for Hierarchical Text Classification       Distributional clustering of words/features is one of the ways to reduce dimensionality Each word cluster can be treated as a single feature and thus dimensionality can be drastically reduced Feature clustering is more effective than feature selection especially at lower number of features. Feature clustering appears to preserve classification accuracy as compared to a full-feature classifier In case of small training sets and noisy features, word clustering can actually increase classification accuracy. The algorithms used are agglomerative in nature yielding suboptimal word clusters at a high computational cost DISTRIBUTIONAL WORD CLUSTERING     C be a discrete random variable that takes on values from the set of classes W be the random variable that ranges over the set of words The joint distribution p(C,W) can be estimated from the training set. Now suppose we cluster the words into k clusters W1,…,Wk. As Interested in reducing the number of features and the model size, we only look at “hard" clustering where each word belongs to exactly one word cluster, i.e     The random variable W range over the word clusters. The information about C captured by W can be measured by the mutual information I(C;W). Ideally, in formation of word clusters exact preservation of the mutual information is expected However clustering usually lowers mutual information Thus its essential to find a clustering that minimizes the decrease in mutual information, I(C;W) - I(C;W ) Classifying using word Clusters   The Naive Bayes method can be simply translated into using word clusters instead of words. This is done by estimating the new parameters p(Ws/ci) for word clusters similar to the word parameters p(wt/ci) in as   When estimates of p(wt/ci) for individual words are relatively poor, the corresponding word cluster parameters p(Ws/ci) provide more robust estimates resulting in higher classification scores The Naive Bayes rule (5) for classifying a test document d can be rewritten as   Figures 2 and 3 plot the fraction of mutual information lost against the number of clusters for both the divisive and agglomerative algorithms on the 20Ng and Dmoz data sets. It can be seen that less mutual information is lost with Divisive Clustering compared to Agglomerative Clustering at all number of clusters, though the difference is more pronounced at lower number of clusters      Figure 5 shows classification accuracies on the 20 Newsgroups data set for the algorithms considered. The horizontal axis indicates the number of features/clusters used in the classification model while the vertical axis indicates the percentage of test documents that were classified correctly Divisive Clustering (SVM as well as NB) achieves significantly better results at lower number of features than feature selection using Information Gain and Mutual Information With 50 clusters, Divisive Clustering (NB) achieves 78.05% accuracy The largest gain occurs when the number of clusters equals the number of classes         In Figure 6, the classification accuracy is plotted on 20Ng data using Naive Bayes when the training data is sparse. 2% of the available data is taken, that is 20 documents per class, for training and tested on the remaining 98% of the documents The results are averages of 5-10 trials Divisive Clustering obtains better results than Information Gain at all number of features. It also achieves a significant 12% increase over the maximum possible accuracy achieved by Information Gain. This is in contrast to Figure 5 where Information Gain eventually catches up as the number of features increases When the training data is small the word by class frequency matrix contains many zero entries. By clustering words, more robust estimates of word class probabilities are obtained which lead to higher classification accuracies Classification of Web Content    Utilizes known hierarchical structure, the classification problem is decomposed into a set of smaller problems corresponding to hierarchical splits in the tree Aims to first learn to distinguish among classes at the top level, then lower level distinctions are learned only within the appropriate top level of the tree. Each of these sub-problems can be solved much more efficiently, and more accurately as well Classification of Web Content     The use of a hierarchical decomposition of a classification problem allows for efficiencies in both learning and representation. Each sub-problem is smaller than the original problem It is sometimes possible to use a much smaller set of features for each The hierarchical structure can also be used to set the negative set for discriminative training and at classification time to combine information from different levels SVM for Web Content SVMs have been found to be efficient and effective for text classification, but not previously explored in the context of hierarchical classification.  The efficiency of SVMs for both initial learning and real-time classification make them applicable to large dynamic collections like web content  Hierarchical structure used in Web Content  Hierarchical structure is used for two purposes  Train second-level category models using different contrast sets (either within the same top-level category in the hierarchical case, or across all categories in the flat non hierarchical case).  Combine scores from the top- and second-level models using different combination rules, some requiring a threshold to be exceeded at the top level before second level comparisons are made. Classifying web search results    Classification techniques are used to automatically organize search results into existing hierarchical structures Classification models are learned offline using a training set of human-labeled documents and web categories Classification offers two advantages compared to clustering –   run time classification is very efficient manually generated category names are easily understood Constraints   To support goal of automatically classifying web search results two constraints  Use of just the short summaries returned from web search engines. since it takes too long to retrieve the full text of pages in a networked environment. These automatically generated summaries are much shorter than the texts used in most classification experiments, and they are much noisier than other document surrogates like abstracts that some have worked with.  Focus on the top levels of the hierarchy since we believe that many search results can be usefully disambiguated at this level. Develop an interface that tightly couples search results and category structure are found to have large preference and performance advantages for automatically classified search results TEXT CLASSIFICATION USING SVMs     Text classification involves a training phase and a testing phase. In the training phase, a large set of web pages with known category labels are used to train a classifier. An initial model is built using a subset of the labeled data, and a holdout set is used to identify optimal model parameters. During the testing or operational phase, the learned classifier is used to classify new web pages.    A support vector machine (SVM) algorithm was used as the classifier A linear SVM is a hyperplane that separates a set of positive examples from a set of negative examples with maximum margin. The margin is the distance from the hyperplane to the nearest of the positive and negative examples.     In the linearly separable case maximizing the margin can be expressed as an optimization problem In cases where points are not linearly separable, slack variables are introduced that permit, but penalize, points that fall on the wrong side of the decision boundary Problems that are not linearly separable, kernel methods can be used to transform the input space so that some non-linear problems can be learned. The simplest linear form of the SVM can be used because it provided good classification accuracy, and is fast to learn and apply Feature Selection in Web Content    Reduction in the feature space by eliminating words that appear in only a single document then selecting the 1000 words with highest mutual information with each category. The mutual information MI(F, C) between a feature, F, and a category, C, is defined as: Compute the mutual information between every pair of features and categories SVM Parameters  In addition to varying the number of features, SVM performance is governed by two parameters  C (the penalty imposed on training examples that fall on the wrong side of the decision boundary)  p (the decision threshold)      The Default C parameter value (0.01) is used The decision threshold, p, can be set to control precision and recall for different tasks. Increasing p, results in fewer test items meeting the criterion, and this usually increases precision but decreases recall. Conversely, decreasing p typically decreases precision but increases recall. P is chosen so as to optimize performance on the F1 measure on a training validation set. Results       Decision thresholds were established on a training validation set For each category, if a test item exceeds the decision threshold, it is judged to be in the category A test item can be in zero, one, or more than one categories From this precision (P) and recall (R) are computed These are micro-averaged to weight the contribution of each category by the number of test examples in it For each test example, the probability of it being in each of the 13 top-level categories and each of the 150 second-level categories is computed References     Improving Text Classication by Shrinkage in a Hierarchy of Classes by Andrew McCallum, Ronald Rosenfeld, Tom Mitchell, Andrew Y. Ng http://www.cs.cmu.edu/~mccallum/papers/hier-icml98.ps.gz Hierarchically classifying documents using very few words by Daphne Koller and Mehran Sahami http://dbpubs.stanford.edu:8090/pub/showDoc.Fulltext?lang=en&do c=1997-75&format=pdf&compression=&name=1997-75.pdf Hierarchical Classification of Web Content by Susan Dumais and Hao Chen http://research.microsoft.com/~sdumais/sigir00.pdf Enhanced Word Clustering for Hierarchical Text Classification by Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar http://www.cs.utexas.edu/users/inderjit/public_papers/hierdist.pdf THANK YOU

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Concept Ontology for Text Classification