Download slides

Text Classifier Induction: Naive Bayes Classifiers ML for NLP Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado https://www.cs.tcd.ie/kevin.koidl/cs4062/ [email protected], [email protected] 2017 2 Defining a CSV function I I Inductive construction of a text categorization module consists of defining a Categorization Status Value (CSV) function CSV for Ranking and Hard classifiers: I Ranking classifiers: for each category ci ∈ C, define a function CSVi with the following signature: CSVi : D → [0, 1] I (1) Hard classifiers: one can either define CSVi as above and define a threshold τi above which a document is said to belong to ci , or constrain CSVi to range over {T , F } directly. Category membership thresholds I Hard classifier status value, CSVih : D → {T , F }, can then be defined as follows: ( T if CSVi ≥ τi , h CSVi (d) = (2) F otherwise. I Thresholds can be determined analytically or experimentally. I Analytically derived thresholds are typical of TC systems that output probability estimates of membership of documents to categories I τi is then determined by decision-theoretic measures (e.g. utility) Experimental thresholds I CSV thresholding or SCut: Scut stands for optimal thresholding on the confidence scores of category candidates: I Vary τi on Tv and choose the one that maximises effectiveness I Proportional thresholding: choose τi s.t. that generality measure gTr (ci ) is closest to gTv (ci ). I RCut or fixed thresholding: stipulate that a fixed number of categories are to be assigned to each document. I See [Yang, 2001] for a survey of thresholding strategies. ML methods for learning CSV functions I Symbolic, numeric and meta-classification methods. I Numeric methods implement classification indirectly: the classification function fˆ outputs a numerical score, hard classification via thresholding I probabilistic classifiers, regression methods, ... I Symbolic methods usually implement hard classification directly I e.g.: decision trees, decision rules, ... I Meta-classification methods combine results from independent classifiers I e.g.: classifier ensembles, committees, ... Probabilistic classifiers I I The CSV() of probabilistic classifiers produces an estimate of ~ = fˆ(d, c) that an instance the conditional probability P(c|d) ~ represented as d should be classified as c. Components of d~ regarded as random variables Ti (1 ≤ i ≤ |T |) I Need to estimate probabilities for all possible representations i.e. P(c|Ti , . . . , Tn ). I Too costly in practice: for discrete case and m possible nominal values that is O(mT ) I Independence assumptions help... 7 Conditional independence assumption I Using Bayes’ rule we get P(c|d~j ) = 7 Conditional independence assumption I Using Bayes’ rule we get P(c|d~j ) = P(c)P(d~j |c) P(d~j ) (3) 7 Conditional independence assumption I Using Bayes’ rule we get P(c|d~j ) = I P(c)P(d~j |c) P(d~j ) (3) Naı̈ve Bayes classifiers: assume Ti , . . . , Tn are independent of each other given the target category: 7 Conditional independence assumption I Using Bayes’ rule we get P(c|d~j ) = I P(c)P(d~j |c) P(d~j ) (3) Naı̈ve Bayes classifiers: assume Ti , . . . , Tn are independent of each other given the target category: P(~d|c) = |T | Y P(tk |c) (4) k=1 I I maximum a posteriori hypothesis: choose c that maximises (3) maximum likelihood hypothesis: choose c that maximises P(d~j |c) (i.e. assume all c’s are equally likely) Variants of Naive Bayes classifiers I multi-variate Bernoulli models, in which features are modelled as Boolean random variables, and I multinomial models where the variables represent count data [McCallum and Nigam, 1998] Continuous models which use numeric data representation: attributes represented by continuous probability distributions I 8 I using Gaussian distributions, the conditionals can be estimated as (t−µ)2 1 (5) P(Ti = t|c) = √ e − 2σ2 σ 2π I Non-parametric kernel density estimation has also been proposed [John and Langley, 1995] Some Uses of NB in NLP I Information retrieval [Robertson and Jones, 1988] I Text categorisation (see [Sebastiani, 2002] for a survey) I Spam filters I Word sense disambiguation [Gale et al., 1992] CSV for multi-variate Bernoulli models I Starting from the independence assumption P(~d|c) = |T | Y P(tk |c) k=1 I and Bayes’ rule P(c|d~j ) = I → − derive a monotonically increasing function of P(c| d ): fˆ(d, c) = |T | X i=1 I 10 P(c)P(d~j |c) P(d~j ) ti log P(ti |c)[1 − P(ti |c̄)] P(ti |c̄)[1 − P(ti |c)] Need to estimate 2|T |, rather than 2|T | parameters. (6) Estimating the parameters I For each term ti ∈ T , make: I I nc ← the number of ~d s.t. f (~d, c) = 1 ni ← the number of ~d for which ti = 1 and f (~d, c) = 1 P(ti |c) ← I I I (7) (sums in numerator and denominator for smoothing; see next slides) nc̄ ← the number of ~d s.t. f (~d, c) = 0 ni¯ ← the number of ~d for which ti = 1 and f (~d, c) = 0 P(ti |c̄) ← 11 ni + 1 nc + 2 ni¯ + 1 nc̄ + 2 (8) An Alternative: multinomial models I An alternative implementation of the Naı̈ve Bayes Classifier is described in [Mitchell, 1997]. I In this approach, words appear as values rather than names of attributes I A document representation for this slide would look like this: ~d = ha1 = ”an”, a2 = ”alternative”, a3 = ”implementation”,. . . i I Problem: each attribute’s value would range over the entire vocabulary. Many values would be missing for a typical document. 13 Dealing with missing values I what if none of the training instances with target category cj have attribute value ai ? 13 Dealing with missing values I what if none of the training instances with target category cj have attribute value ai ? P̂(ai |cj ) = 0, and... Y P̂(cj ) P̂(ai |cj ) = 0 i I What to do? 13 Dealing with missing values I what if none of the training instances with target category cj have attribute value ai ? P̂(ai |cj ) = 0, and... Y P̂(cj ) P̂(ai |cj ) = 0 i I What to do? I Smoothing: make Bayesian estimate for P̂(ai |cj ) P̂(ai |cj ) ← nc + mp n+m where: I I I I n is number of training examples for which C = cj , nc number of examples for which C = cj and A = ai p is prior estimate for P̂(ai |cj ) m is weight given to prior (i.e. number of “virtual” examples) 14 Learning in multinomial models 1 2 3 4 5 6 NB Learn ( Tr , C ) /∗ c o l l e c t a l l t o k e n s t h a t o c c u r i n Tr ∗/ T ← a l l d i s t i n c t words and o t h e r t o k e n s i n Tr /∗ c a l c u l a t e P(cj ) and P(tk |cj ) ∗/ f o r e a c h t a r g e t v a l u e cj i n C do Tr j ← s u b s e t o f Tr f o r w h i c h t a r g e t v a l u e i s cj j 7 8 9 10 11 12 13 14 | P(cj ) ← |Tr |Tr | Textj ← c o n c a t e n a t i o n o f a l l t e x t s i n Tr j n ← t o t a l number o f t o k e n s i n Textj f o r e a c h word tk i n T do nk ← number o f t i m e s word tk o c c u r s i n Textj nk +1 P(tk |cj ) ← n+|T | done done Note an additional assumption: position is irrelevant, i.e.: P(ai = tk |cj ) = P(am = tk |cj ) ∀i, m Sample Classification Algorithm I Could calculate posterior probabilities for soft classification n Y fˆ(d) = P(c) P(tk |c) k=1 (where n is then number of tokens in d that occur in T ) and use thresholding as before I 1 2 3 4 15 Or, for SLTC, implement hard categorisation directly: positions ← a l l word p o s i t i o n s i n d that c o n t a i n tokens found i n T R e t u r n cnb , where Q cnb = arg maxci ∈C P(ci ) k∈positions P(tk |ci ) Classification Performance [Mitchell, 1997]: Given 1000 training documents from each group, learn to classify new documents according to which newsgroup it came from comp.graphics comp.os.ms-windows.misc comp.sys.ibm.pc.hardware comp.sys.mac.hardware comp.windows.x misc.forsale rec.autos rec.motorcycles rec.sport.baseball rec.sport.hockey alt.atheism soc.religion.christian talk.religion.misc talk.politics.mideast talk.politics.misc talk.politics.guns sci.space sci.crypt sci.electronics sci.med Naive Bayes: 89% classification accuracy. Learning performance I Learning Curve for 20 Newsgroups: NB: TFIDF and PRTFIDF are non-Bayesian probabilistic methods we will see later in the course. See [Joachims, 1996] for details. 17 NB and continuous variables I Another model: suppose we want our document vectors to represent, say, the TF-IDF scores of each term in the document: ~d = ha1 = tfidf (t1 ), . . . , an = tfidf (tn )i I 18 How would we estimate P(c|~d)? (9) NB and continuous variables I Another model: suppose we want our document vectors to represent, say, the TF-IDF scores of each term in the document: ~d = ha1 = tfidf (t1 ), . . . , an = tfidf (tn )i I How would we estimate P(c|~d)? I A: assuming an underlying (e.g. normal) distribution: P(c|~d) ∝ n Y (9) P(ai |c) i=n = 1 √ σc 2π e − (x−µc )2 2σc2 (10) µb and σb2 are mean and variance of the values taken by the attributes for positive instances. 18 Combining variables I NB also allows you yo combine different types of variables. I The result would be a Bayesian Network with continuous and discrete nodes. For instance: C a1 I 19 a2 ... ak an See [Luz, 2012, Luz and Su, 2010] for examples of the use of such combined models in a different categorisation task. Naive but subtle I Conditional independence assumption is clearly false Y P(a1 , a2 . . . an |vj ) = P(ai |vj ) i I 20 ...but NB works well anyway. Why? Naive but subtle I Conditional independence assumption is clearly false Y P(a1 , a2 . . . an |vj ) = P(ai |vj ) i I ...but NB works well anyway. Why? I posteriors P̂(vj |x) don’t need to be correct; We need only that: Y arg max P̂(vj ) P̂(ai |vj ) = arg max P(vj )P(a1 . . . , an |vj ) vj ∈V i vj ∈V In othe words, error in NB classification is a zero-one loss function, often correct even if posteriors are unrealistically close to 1 or 0 [Domingos and Pazzani, 1996]. Performance can be optimal if dependencies are evenly distributed over classes, or if they cancel each other out [Zhang, 2004]. 20 Other Probabilistic Classifiers I Alternative approaches to probabilistic classifiers attempt to improve effectiveness by: I I I I adopting weighted document vectors, rather than binary-valued ones introducing document length normalisation, in order to correct distortions in CSVi introduced by long documents relaxing the independence assumption (the least adopted variant, since it appears that the binary independence assumption seldom affects effectiveness) But see, for instance Hidden Naive Bayes [Zhang et al., 2005]... References I Domingos, P. and Pazzani, M. J. (1996). Beyond independence: Conditions for the optimality of the simple bayesian classifier. In International Conference on Machine Learning, pages 105–112. Gale, W., Church, K., and Yarowsky, D. (1992). A method for disambiguating word senses in a large corpus. Computers and the Humanities, 26:415–439. Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization. Technical Report CMU-CS-96-118, CMU. John, G. H. and Langley, P. (1995). Estimating continuous distributions in Bayesian classifiers. In Besnard, Philippe and Hanks, S., editors, Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI’95), pages 338–345, San Francisco, CA, USA. Morgan Kaufmann Publishers. Luz, S. (2012). The non-verbal structure of patient case discussions in multidisciplinary medical team meetings. ACM Transactions on Information Systems, 30(3):17:1–17:24. Luz, S. and Su, J. (2010). Assessing the effectiveness of conversational features for dialogue segmentation in medical team meetings and in the AMI corpus. In Proceedings of the SIGDIAL 2010 Conference, pages 332–339, Tokyo. Association for Computational Linguistics. 22 References II McCallum, A. and Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press. Mitchell, T. M. (1997). Machine Learning. McGraw-Hill. Robertson, S. E. and Jones, K. S. (1988). Relevance weighting of search terms. In Document retrieval systems, pages 143–160. Taylor Graham Publishing, London. Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47. Yang, Y. (2001). A study on thresholding strategies for text categorization. In Croft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., editors, Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-01), pages 137–145, New York. ACM Press. Zhang, H. (2004). The optimality of Naive Bayes. In Proceedings of the 7th International Florida Artificial Intelligence Research Society Conference. AAAI Press. 23 References III Zhang, H., Jiang, L., and Su, J. (2005). Hidden naive bayes. In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 919. Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999. 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download slides