Text Classifier Induction: Naive Bayes Classifiers
ML for NLP
Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado
[email protected], [email protected]
Defining a CSV function
• Inductive construction of a text categorization module consists of defining
a Categorization Status Value (CSV) function
• CSV for Ranking and Hard classifiers:
– Ranking classifiers: for each category ci ∈ C, define a function CSVi
with the following signature:
CSVi : D → [0, 1]
– Hard classifiers: one can either define CSVi as above and define
a threshold τi above which a document is said to belong to ci , or
constrain CSVi to range over {T, F } directly.
Category membership thresholds
• Hard classifier status value, CSVih : D → {T, F }, can then be defined as
T if CSVi ≥ τi ,
CSVi (d) =
F otherwise.
• Thresholds can be determined analytically or experimentally.
• Analytically derived thresholds are typical of TC systems that output
probability estimates of membership of documents to categories
• τi is then determined by decision-theoretic measures (e.g. utility)
Experimental thresholds
• CSV thresholding or SCut: Scut stands for optimal thresholding on the
confidence scores of category candidates:
– Vary τi on T v and choose the one that maximises effectiveness
• Proportional thresholding: choose τi s.t. that generality measure gT r (ci )
is closest to gT v (ci ).
• RCut or fixed thresholding: stipulate that a fixed number of categories are
to be assigned to each document.
• See (Yang, 2001) for a survey of thresholding strategies.
ML methods for learning CSV functions
• Symbolic, numeric and meta-classification methods.
• Numeric methods implement classification indirectly: the classification function
fˆ outputs a numerical score, hard classification via thresholding
– probabilistic classifiers, regression methods, ...
• Symbolic methods usually implement hard classification directly
– e.g.: decision trees, decision rules, ...
• Meta-classification methods combine results from independent classifiers
– e.g.: classifier ensembles, committees, ...
Probabilistic classifiers
• The CSV() of probabilistic classifiers produces an estimate of the condi~ = fˆ(d, c) that an instance represented as d~ should
tional probability P (c|d)
be classified as c.
• Components of d~ regarded as random variables Ti (1 ≤ i ≤ |T |)
• Need to estimate probabilities for all possible representations i.e. P (c|Ti , . . . , Tn ).
• Too costly in practice: for discrete case and m possible nominal values
that is O(mT )
• Independence assumptions help...
Notes on the notation:
• P (d~j ) = probability that a randomly picked text has d~j as its representation
• P (ci ) = probability that a randomly picked text belongs to ci
Conditional independence assumption
• Using Bayes’ rule we get
P (c)P (d~j |c)
P (c|d~j ) =
P (d~j )
• Naı̈ve Bayes classifiers: assume Ti , . . . , Tn are independent of each other given
the target category:
|T |
P (~d|c) =
P (tk |c)
• maximum a posteriori hypothesis: choose c that maximises (3)
• maximum likelihood hypothesis: choose c that maximises P (d~j |c) (i.e. assume
all c’s are equally likely)
Variants of Naive Bayes classifiers
• multi-variate Bernoulli models, in which features are modelled as Boolean
random variables, and
• multinomial models where the variables represent count data (McCallum
and Nigam, 1998)
• Continuous models which use numeric data representation: attributes represented by continuous probability distributions
– using Gaussian distributions, the conditionals can be estimated as
P (Ti = t|c) =
√ e− 2σ2
σ 2π
– Non-parametric kernel density estimation has also been proposed
(John and Langley, 1995)
Some Uses of NB in NLP
• Information retrieval (Robertson and Jones, 1988)
• Text categorisation (see (Sebastiani, 2002) for a survey)
• Spam filters
• Word sense disambiguation (Gale et al., 1992)
CSV for multi-variate Bernoulli models
• Starting from the independence assumption
P (~d|c) =
|T |
P (tk |c)
• and Bayes’ rule
P (c)P (d~j |c)
P (c|d~j ) =
P (d~j )
• derive a monotonically increasing function of P (c| d ):
fˆ(d, c)
|T |
ti log
P (ti |c)[1 − P (ti |c̄)]
P (ti |c̄)[1 − P (ti |c)]
• Need to estimate 2|T |, rather than 2|T | parameters.
Consider the Boolean document representation described in previous notes
where a d~j =< t1j , t2j , ... > could be, for instance, < f or = 1, a = 1, typical =
1, summer = 1, winter = 0, ... >. This is the type of document representation
we have in mind for this Multi-variate Bernoulli implementation of the Naive
Bayes TC.
Estimating the parameters
• For each term ti ∈ T , make:
– nc ← the number of ~d s.t. f (~d, c) = 1
– ni ← the number of ~d for which ti = 1 and f (~d, c) = 1
P (ti |c) ←
ni + 1
nc + 2
∗ (sums in numerator and denominator for smoothing; see next
– nc̄ ← the number of ~d s.t. f (~d, c) = 0
– nī ← the number of ~d for which ti = 1 and f (~d, c) = 0
P (ti |c̄) ←
nī + 1
nc̄ + 2
An Alternative: multinomial models
• An alternative implementation of the Naı̈ve Bayes Classifier is described in
(Mitchell, 1997).
• In this approach, words appear as values rather than names of attributes
• A document representation for this slide would look like this:
~d = ha1 = ”an”, a2 = ”alternative”, a3 = ”implementation”,. . . i
• Problem: each attribute’s value would range over the entire vocabulary. Many
values would be missing for a typical document.
Dealing with missing values
• what if none of the training instances with target category cj have attribute value ai ?
P̂ (ai |cj ) = 0, and...
P̂ (cj )
P̂ (ai |cj ) = 0
• What to do?
• Smoothing: make Bayesian estimate for P̂ (ai |cj )
P̂ (ai |cj ) ←
nc + mp
– n is number of training examples for which C = cj ,
– nc number of examples for which C = cj and A = ai
– p is prior estimate for P̂ (ai |cj )
– m is weight given to prior (i.e. number of “virtual” examples)
Learning in multinomial models
NB Learn ( T r, C )
/∗ c o l l e c t a l l
t o k e n s t h a t o c c u r i n T r ∗/
T ← a l l d i s t i n c t words and o t h e r t o k e n s i n T r
/∗ c a l c u l a t e P (cj ) and P (tk |cj ) ∗/
f o r each t a r g e t v a l u e cj i n C do
T rj ← s u b s e t o f T r f o r which t a r g e t v a l u e i s cj
|T r j |
P (cj ) ← |T r|
T extj ← c o n c a t e n a t i o n o f a l l t e x t s i n T rj
n ← t o t a l number o f t o k e n s i n T extj
f o r each word tk i n T do
nk ← number o f t i m e s word tk o c c u r s i n T extj
nk +1
P (tk |cj ) ← n+|T
Note an additional assumption: position is irrelevant, i.e.:
P (ai = tk |cj ) = P (am = tk |cj ) ∀i, m
Sample Classification Algorithm
• Could calculate posterior probabilities for soft classification
fˆ(d) = P (c)
P (tk |c)
(where n is then number of tokens in d that occur in T ) and use thresholding
as before
• Or, for SLTC, implement hard categorisation directly:
positions ← a l l word p o s i t i o n s i n d
t h a t c o n t a i n t o k e n s found i n T
Return cnb , where
cnb = arg maxci ∈C P (ci ) k∈positions P (tk |ci )
Classification Performance
(Mitchell, 1997): Given 1000 training documents from each group, learn to
classify new documents according to which newsgroup it came from
Naive Bayes: 89% classification accuracy.
Learning performance
• Learning Curve for 20 Newsgroups:
NB: TFIDF and PRTFIDF are non-Bayesian probabilistic methods we
will see later in the course. See (Joachims, 1996) for details.
NB and continuous variables
• Another model: suppose we want our document vectors to represent, say,
the TF-IDF scores of each term in the document:
~d = ha1 = tf idf (t1 ), . . . , an = tf idf (tn )i
• How would we estimate P (c|~d)?
• A: assuming an underlying (e.g. normal) distribution:
P (c|~d) ∝
P (ai |c)
σc 2π
(x−µc )2
µb and σb2 are mean and variance of the values taken by the attributes for
positive instances.
Combining variables
• NB also allows you yo combine different types of variables.
• The result would be a Bayesian Network with continuous and discrete
nodes. For instance:
• See (Luz, 2012; Luz and Su, 2010) for examples of the use of such combined
models in a different categorisation task.
Naive but subtle
• Conditional independence assumption is clearly false
P (a1 , a2 . . . an |vj ) =
P (ai |vj )
• ...but NB works well anyway. Why?
• posteriors P̂ (vj |x) don’t need to be correct; We need only that:
arg max P̂ (vj )
P̂ (ai |vj ) = arg max P (vj )P (a1 . . . , an |vj )
vj ∈V
vj ∈V
In othe words, error in NB classification is a zero-one loss function, often correct even if posteriors are unrealistically close to 1 or 0 (Domingos and Pazzani,
Performance can be optimal if dependencies are evenly distributed over classes,
or if they cancel each other out (Zhang, 2004).
Other Probabilistic Classifiers
• Alternative approaches to probabilistic classifiers attempt to improve effectiveness by:
– adopting weighted document vectors, rather than binary-valued ones
– introducing document length normalisation, in order to correct distortions in CSVi introduced by long documents
– relaxing the independence assumption (the least adopted variant,
since it appears that the binary independence assumption seldom
affects effectiveness)
– But see, for instance Hidden Naive Bayes (Zhang et al., 2005)...
