Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NLP for Text Mining Towards systems capable of extracting semantic information from texts Presentation by: Tiago Vaz Maia Introduction Keyword-models of text are very poor (e.g. search with google). There is great advantage to a system that ‘understands’ texts, at some level. Need for semantic understanding. CRYSTAL (UMass) CRYSTAL: Inducing a Conceptual Dictionary S. Soderland, D. Fisher, J. Aseltine, W. Lehnert In Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, 1995. Information Extraction Systems Generally domain-specific (e.g. medical records, financial news). Work by having “a dictionary of linguistic patterns that can be used to identify references to relevant information in a text”. Op. Cit. A ‘Concept-Node’ Definition CN-type: Sign or Symptom Subtype: Absent Extract from Direct Object Active voice verb Subject constraints: words include “PATIENT” head class: <Patient or Disabled Group> Verb constraints: words include “DENIES” Direct Object constraints: head class <Sign or Symptom> Op. Cit. Rationale for CRYSTAL Building domain-specific dictionaries is time-consuming (knowledge-engineering bottleneck). CRYSTAL builds such dictionaries automatically, from annotated texts (supervised learning). Annotation of Texts E.g. “Unremarkable with the exception of mild shortness of breath and chronically swollen ankles”. Domain expert marks “shortness of breath” and “swollen ankles” with CN type “sign or symptom” and subtype “present”. (Example from Op. Cit.) CRYSTAL’s Output A dictionary of information extraction rules (i.e. concept-nodes) specific for the domain. These rules should be general enough to apply to other texts in the same domain. Algorithms Next time… Five minutes go by very fast! Conclusions Domain-specific information extraction systems capable of semantic understanding are within reach of current technology. CRYSTAL makes such systems scalable and easily portable by automating the process of dictionary construction. CRYSTAL’s Dictionary Induction Algorithms Tiago V. Maia Review: Information Extraction Information Extraction Systems extract useful information from free text. This information can, for example, be stored in a database, where it can be data-mined, etc. E.g., from hospital discharge reports we may want to extract: Review: Information Extraction Name of patient. Diagnosis. Symptoms. Prescribed Etc. treatments. Review: Dictionary of ConceptNodes In the UMass system extraction rules are stored in concept-nodes. The set of all concept-nodes is called a dictionary. A ‘Concept-Node’ Definition CN-type: Sign or Symptom Subtype: Absent Extract from Direct Object Active voice verb Subject constraints: words include “PATIENT” head class: <Patient or Disabled Group> Verb constraints: words include “DENIES” Direct Object constraints: head class <Sign or Symptom> Op. Cit. Review: CRYSTAL CRYSTAL automatically induces a domain-specific dictionary, from an annotated training corpus. Learning Algorithm Two steps: 1. Create one concept-node per positive training instance. 2. Gradually merge different conceptnodes to achieve a more general and compact representation. Constructing Initial CN’s E.g. of step 1: Sentence: “Unremarkable with the exception of mild shortness of breath and chronically swollen ankles”. Annotation: “shortness of breath” and “swollen ankles” are marked with CN type “sign or symptom”, subtype “present”. Initial CN Definition CN-type: Sign or Symptom Subtype: Present Extract from Prep. Phrase “WITH” Verb= <NULL> Subject constraints: words include “UNREMARKABLE” Prep. Phrase constraints: words include “THE EXCEPTION OF MILD SHORTNESS OF BREATH AND CHRONICALLY SWOLLEN ANKLES” head class <Sign or Symptom> Op. Cit. Need for Induction Initial concept-nodes are too specific to be useful for any texts other than the training corpus. One needs an inductive step, capable of constructing more general definitions Step 2. Inducing General CN’s Main idea: Merge sufficiently similar CN’s, until doing more merges starts generating too many errors. How do we merge similar CN’s? The goal is to obtain a general CN that ‘covers’ both CN’s and provides a good generalization for unseen cases. Merging CN’s The unification of two CN’s is found by relaxing the constraints in such a way that they cover both nodes. Word constraints: Intersection of the word constraints from each CN. E.g.: Verb constraints: “vehemently “denies denies” categorically” “denies” Merging CN’s Semantic class constraints: Found by moving up the semantic hierarchy. E.g.: Prep. phrase constraints: head class <Sign or Symptom> head class <Lab or Test Result> head class <Finding>, if in the semantic hierarchy we have: Semantic Hierarchy Finding Sign or Symptom Lab or Test Result Evaluating Merges Every merged CN is tested against the training corpus. If its error rate is above a certain threshold, it is discarded. The system continues merging CN’s until no more can be merged without resulting in a CN whose error rate exceeds the prespecified tolerance. Results MUC-3: dictionary built using 1500 hours of work by two advanced graduate students and one post-doc. MUC-4: using Autoslog, a precursor of CRYSTAL, dictionary was built using 8 hours of work by a first-year graduate student! Both dictionaries presented roughly the same functionality. Conclusions Automated induction of domain-specific information extraction dictionaries is very good alternative to hand-coding. Knowledge engineering effort drastically reduced, allowing for widespread realworld applications. Combining Information Extraction and Data Mining Tiago V. Maia “Using Information Extraction to Aid the Discovery of Prediction Rules from Text” U.Y. Nahm and R.J. Mooney In: KDD-2000 Workshop on Text Mining An Approach to Text Mining Text IE DB KDD Rules The Application Step 1. Starting from free-text job postings in a newsgroup, build a database of jobs. Step 2. Mine that database, to find interesting rules. Sample Job Posting Sample job posting: “Leading Internet Provider using cutting edge web technology in Austin is accepting applications for a Senior Software Developer. The candidate must have 5 years of software development, which includes coding in C/C++ and experience with databases (Oracle, Sybase, Informix, etc.)…” Sample Job Record Title: Senior Software Developer Salary: $70-85K City: Austin Language: Perl, C, Javascript, Java, C++ Platform: Windows Application: Oracle, Informix, Sybase Area: RDBMS, Internet, Intranet, Ecommerce Required years of experience: 5 Required degree: BS Sample Extracted Rule “If a computer-related job requires knowledge of Java and graphics then it also requires knowledge of Photoshop” Information Extraction Uses RAPIER: a system similar to CRYSTAL, that also constructs the extraction rules automatically, from an annotated training corpus. Rule Induction The induced rules predict the value in a database field, given the values in the rest of the record. Each slot-value pair is treated as a distinct binary feature. E.g., Oracle Application. An example of an induced rule: HTML Language Windows NT Platform Active Server Pages Application Database Area Algorithms for Rule Induction Uses C4.5. Decision trees are learned using the binary representation for slot-value pairs, and pruned to yield the rules. Conclusions Text mining can be achieved by information extraction followed by the application of standard KDD techniques to the resulting structured database. Both IE and KDD are well understood, and their combination should yield practical real-world systems. Learning Probabilistic Relational Models Getoor, L., Friedman, N., Koller,D. and Pfeffer, A. Invited contribution to the book Relational Data Mining, Dzeroski, S. and Lavrac, N. (Eds.), Springer-Verlag, 2001. Applicability Text IE DB KDD JPD Probabilistic Relational Models Probabilistic relational models are a marriage of: 1. Probabilistic graphical models, and 2. Relational models. Why a Probabilistic Model? Probabilistic graphical models (e.g. Bayesian networks) have proven very successful for representing statistical patterns in data. Algorithms have been developed for learning such models from data. Because what is learnt is a joint probability distribution, these models are not restricted to answering questions about specific attributes. Why a Relational Model? Typically, Bayesian networks (or graphical models in general) use a representation that consists of several variables (e.g. height, weight, etc.) but has no structure. The most common way of structuring data is in a relational form (e.g. relational databases). Data structured in this way consists of individuals, their attributes, and relations between individuals (e.g. database of students, classes, etc.). Probabilistic Relational Models Probabilistic relational models are a natural way to represent and learn statistical regularities in structured information. Moreover, because they are close to relational databases, they are ideal for data mining. Introduction to Bayes Nets The problem with using joint probability distributions is their exponential character in the general case. E.g., assume that there are four random variables: Student Intelligence, Course Difficulty, Student Understands Material, Student Grade. Assume Student Intelligence, Course Difficulty and Student Understands Material have three possible values: {low, medium, high}. Exponential Complexity of JPDs Further assume that the Student Grade has six possible values: {A, B, C, D, E, F}. Then, to have a joint probability distribution, we need to specify (or learn) 3x3x3x6 = 162 values. Imagine if we had hundreds of variables... Independence Assumptions in Bayes Nets Bayes nets help because they exploit the fact that each variable typically depends directly only on a small number of other variables. In our example, we have: Example of a Bayes Net Intelligence Difficulty Understands Material Grade Conditional Independence That is, for example the difficulty of the test is independent of the intelligence of the student. Also, importantly, the grade is independent of the intelligence of the student or the difficulty of the test, given the student’s understanding of the material Conditional independence. Conditional Independence Formally, we have: Every node X is independent of every other node that does not descend from X, given the values of X’s parents. Conditional Probability Distributions (CPDs) Associated with each node is a conditional probability distribution, which specifies the probability distributions for the node given the values for the parents. CPD for the Understanding Node Understanding Intelligence low low low medium medium medium high high high Difficulty low medium high low medium high low medium high low 0.2 0.45 0.75 0.15 0.2 0.45 0.01 0.1 0.2 medium 0.6 0.4 0.2 0.3 0.6 0.4 0.19 0.35 0.6 high 0.2 0.15 0.05 0.55 0.2 0.15 0.8 0.55 0.2 Conditional Independence Savings in Complexity For our example, we need to specify the following values: 3 for the probabilities of the 3 possible values for Intelligence. 3 for the probabilities of the 3 possible values for Difficulty. 27 for the CPD for the Understanding node. 18 for the CPD for the Grade node. Total: 51 (instead of 162). Conditional Independence Savings in Complexity (cont.) For cases where there are hundreds or thousands of variables, the fact that each variable will depend only on a small number of other variables will entail significantly more dramatic savings. Joint Probability Distribution Given the CPDs, we can calculate the JPD via the Chain Rule for Bayesian Networks. In our example, that is: P(Intelligence, Difficulty, Understanding, Grade)= P(Intelligence) x P(Difficulty) x P(Understanding | Intelligence, Difficulty) x P(Grade | Understanding) Queries Once we have the joint probability distribution, we can answer any queries about probability distributions of variables, given any set of observed variables. Bayesian Network Classifiers Friedman, N., Geiger, D., and Goldszmidt, M. In: Machine Learning, 20:131-163, 1997. Presentation by: Tiago Vaz Maia Learning Bayes Nets The problem: Given a training set: x21, …, xM1 x12, x22, …, xM2 … x1N, x2N, …, xMN x11, Find the network that best matches this training set. Learning Bayes Nets (cont.) Two separable problems: Learning the structure of the network. Learning the parameters. Learning the parameters is easy. The hard part is learning the structure. Learning the Structure of Bayes Nets The general approach is to have a score that evaluates the match between the network and the training set, and to search for the best matching network. We will be concentrating on a score based on minimal description length (MDL). MDL MDL(B|D) = 1/2 log N |B| - LL(B|D), where: B is a Bayes net. D is a training set. N is the number of training examples. |B| is the number of parameters in B. LL is the log likelihood. The goal is to find the network with minimum MDL. MDL (cont.) The first term represents the size of the network, and forces the network to be small (avoid overfitting). The second term measures how closely B models the probability distribution in data D. The combination of the two provides a balance between a good fit to the data, while maintaining few enough parameters that there is good generalization ability. Search Search is based on a greedy strategy. Initially, we start with the empty network, and successively apply local operations that maximally improve the network’s MDL, until we hit a local minimum. The local operations are: Arc addition, arc deletion, and arc reversal. Bayes Nets as Classifiers Given a training set <A1, …, An, C>, one can use the learning technique just described to learn a Bayes network that does not rely on independence assumptions, like in naïve Bayes. However, in cases where there are many attributes (e.g. over 30), this approach actually performs worse than naïve Bayes. Why? Performance of Bayesian Network Classifiers The problem is in the definition of the MDL score. It is possible that a network might have a very good MDL score and yet perform poorly as a classifier. Roughly speaking, the network is trying to fit all the data (including the attributes), rather than just trying to estimate as accurately as possible P(C| A1, …, An), which is what is relevant for the classification. Tree-Augmented Naïve Bayesian Classifiers (Continuation of Friedman et al.) Presentation by: Tiago Vaz Maia Introduction Last time: General Bayesian networks often perform worse than Naïve Bayes in classification problems with many attributes. Today: An extension to Naïve Bayes that allows edges between attributes, is computationally tractable, and outperforms both Naïve Bayes and general Bayes nets in classification problems. Augmented Naïve Bayesian Networks The problem with the general Bayes nets is that the structure of the classification problem is not being exploited. In classification, the class should be treated differently from the attributes. Idea: Maintain the structure of Naïve Bayes, but allow edges between the attributes. Augmented Naïve Bayes Nets (cont.) E.g.: C A1 A2 ..….. An Augmented Naïve Bayes (cont.) Problem: Any combination of edges between the attributes is possible, so this is computationally intractable. Solution: Allow only a subset of possible edge combinations. Namely, the edges between the attributes must form a tree. Tree-Augmented Naïve Bayes Nets (TAN) E.g.: Document type Professor Publications Course Grade X Student Tree-Augmented Naïve Bayes Nets Obviously not ideal from the point of view of expressiveness. However, more expressive than Naïve Bayes, and computationally tractable: There exists a polynomial-time algorithm for finding the optimal tree structure between the attributes. Learning Tree Bayes Nets from Data 1. Compute mutual information between every pair of variables: I(X;Y). 2. Build a complete undirected graph in which the vertices are the variables, and annotate the weight of an edge connecting X and Y with I(X;Y). 3. Build a maximum weight spanning tree. 4. Transform the resulting undirected tree to a directed one by choosing a root variable and setting the direction of all edges to be outward from it. Learning TANs Exactly the same, with the following minor changes: In step 1 instead of having I(Ai;Aj) we have I(Ai;Aj | C). In the end, we add the class node and the edges from that node to all attributes. Experimental Results In a series of experimental tests conducted with data sets from U.C. Irvine’s repository, TANs often outperformed: Naïve Bayes. General Bayes Nets. C4.5.