Download lecture notes on Naive Bayes classifiers.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Text Classifier Induction: Naive Bayes Classifiers
ML for NLP
Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado
https://www.cs.tcd.ie/kevin.koidl/cs4062/
[email protected], [email protected]
2017
Defining a CSV function
• Inductive construction of a text categorization module consists of defining
a Categorization Status Value (CSV) function
• CSV for Ranking and Hard classifiers:
– Ranking classifiers: for each category ci ∈ C, define a function CSVi
with the following signature:
CSVi : D → [0, 1]
(1)
– Hard classifiers: one can either define CSVi as above and define
a threshold τi above which a document is said to belong to ci , or
constrain CSVi to range over {T, F } directly.
Category membership thresholds
• Hard classifier status value, CSVih : D → {T, F }, can then be defined as
follows:
(
T if CSVi ≥ τi ,
h
CSVi (d) =
(2)
F otherwise.
• Thresholds can be determined analytically or experimentally.
• Analytically derived thresholds are typical of TC systems that output
probability estimates of membership of documents to categories
• τi is then determined by decision-theoretic measures (e.g. utility)
Experimental thresholds
• CSV thresholding or SCut: Scut stands for optimal thresholding on the
confidence scores of category candidates:
– Vary τi on T v and choose the one that maximises effectiveness
1
• Proportional thresholding: choose τi s.t. that generality measure gT r (ci )
is closest to gT v (ci ).
• RCut or fixed thresholding: stipulate that a fixed number of categories are
to be assigned to each document.
• See (Yang, 2001) for a survey of thresholding strategies.
ML methods for learning CSV functions
• Symbolic, numeric and meta-classification methods.
• Numeric methods implement classification indirectly: the classification function
fˆ outputs a numerical score, hard classification via thresholding
– probabilistic classifiers, regression methods, ...
• Symbolic methods usually implement hard classification directly
– e.g.: decision trees, decision rules, ...
• Meta-classification methods combine results from independent classifiers
– e.g.: classifier ensembles, committees, ...
Probabilistic classifiers
• The CSV() of probabilistic classifiers produces an estimate of the condi~ = fˆ(d, c) that an instance represented as d~ should
tional probability P (c|d)
be classified as c.
• Components of d~ regarded as random variables Ti (1 ≤ i ≤ |T |)
• Need to estimate probabilities for all possible representations i.e. P (c|Ti , . . . , Tn ).
• Too costly in practice: for discrete case and m possible nominal values
that is O(mT )
• Independence assumptions help...
Notes on the notation:
• P (d~j ) = probability that a randomly picked text has d~j as its representation
• P (ci ) = probability that a randomly picked text belongs to ci
Conditional independence assumption
• Using Bayes’ rule we get
P (c)P (d~j |c)
P (c|d~j ) =
P (d~j )
2
(3)
• Naı̈ve Bayes classifiers: assume Ti , . . . , Tn are independent of each other given
the target category:
|T |
Y
P (~d|c) =
P (tk |c)
(4)
k=1
• maximum a posteriori hypothesis: choose c that maximises (3)
• maximum likelihood hypothesis: choose c that maximises P (d~j |c) (i.e. assume
all c’s are equally likely)
Variants of Naive Bayes classifiers
• multi-variate Bernoulli models, in which features are modelled as Boolean
random variables, and
• multinomial models where the variables represent count data (McCallum
and Nigam, 1998)
• Continuous models which use numeric data representation: attributes represented by continuous probability distributions
– using Gaussian distributions, the conditionals can be estimated as
P (Ti = t|c) =
(t−µ)2
1
√ e− 2σ2
σ 2π
(5)
– Non-parametric kernel density estimation has also been proposed
(John and Langley, 1995)
Some Uses of NB in NLP
• Information retrieval (Robertson and Jones, 1988)
• Text categorisation (see (Sebastiani, 2002) for a survey)
• Spam filters
• Word sense disambiguation (Gale et al., 1992)
CSV for multi-variate Bernoulli models
• Starting from the independence assumption
P (~d|c) =
|T |
Y
P (tk |c)
k=1
• and Bayes’ rule
P (c)P (d~j |c)
P (c|d~j ) =
P (d~j )
−
→
• derive a monotonically increasing function of P (c| d ):
fˆ(d, c)
=
|T |
X
ti log
i=1
3
P (ti |c)[1 − P (ti |c̄)]
P (ti |c̄)[1 − P (ti |c)]
(6)
• Need to estimate 2|T |, rather than 2|T | parameters.
Consider the Boolean document representation described in previous notes
where a d~j =< t1j , t2j , ... > could be, for instance, < f or = 1, a = 1, typical =
1, summer = 1, winter = 0, ... >. This is the type of document representation
we have in mind for this Multi-variate Bernoulli implementation of the Naive
Bayes TC.
Estimating the parameters
• For each term ti ∈ T , make:
– nc ← the number of ~d s.t. f (~d, c) = 1
– ni ← the number of ~d for which ti = 1 and f (~d, c) = 1
P (ti |c) ←
ni + 1
nc + 2
(7)
∗ (sums in numerator and denominator for smoothing; see next
slides)
– nc̄ ← the number of ~d s.t. f (~d, c) = 0
– nī ← the number of ~d for which ti = 1 and f (~d, c) = 0
P (ti |c̄) ←
nī + 1
nc̄ + 2
(8)
An Alternative: multinomial models
• An alternative implementation of the Naı̈ve Bayes Classifier is described in
(Mitchell, 1997).
• In this approach, words appear as values rather than names of attributes
• A document representation for this slide would look like this:
~d = ha1 = ”an”, a2 = ”alternative”, a3 = ”implementation”,. . . i
• Problem: each attribute’s value would range over the entire vocabulary. Many
values would be missing for a typical document.
Dealing with missing values
• what if none of the training instances with target category cj have attribute value ai ?
P̂ (ai |cj ) = 0, and...
Y
P̂ (cj )
P̂ (ai |cj ) = 0
i
• What to do?
4
• Smoothing: make Bayesian estimate for P̂ (ai |cj )
P̂ (ai |cj ) ←
nc + mp
n+m
where:
– n is number of training examples for which C = cj ,
– nc number of examples for which C = cj and A = ai
– p is prior estimate for P̂ (ai |cj )
– m is weight given to prior (i.e. number of “virtual” examples)
Learning in multinomial models
1
2
3
4
5
6
7
8
9
10
11
12
13
14
NB Learn ( T r, C )
/∗ c o l l e c t a l l
t o k e n s t h a t o c c u r i n T r ∗/
T ← a l l d i s t i n c t words and o t h e r t o k e n s i n T r
/∗ c a l c u l a t e P (cj ) and P (tk |cj ) ∗/
f o r each t a r g e t v a l u e cj i n C do
T rj ← s u b s e t o f T r f o r which t a r g e t v a l u e i s cj
|T r j |
P (cj ) ← |T r|
T extj ← c o n c a t e n a t i o n o f a l l t e x t s i n T rj
n ← t o t a l number o f t o k e n s i n T extj
f o r each word tk i n T do
nk ← number o f t i m e s word tk o c c u r s i n T extj
nk +1
P (tk |cj ) ← n+|T
|
done
done
Note an additional assumption: position is irrelevant, i.e.:
P (ai = tk |cj ) = P (am = tk |cj ) ∀i, m
Sample Classification Algorithm
• Could calculate posterior probabilities for soft classification
n
Y
fˆ(d) = P (c)
P (tk |c)
k=1
(where n is then number of tokens in d that occur in T ) and use thresholding
as before
• Or, for SLTC, implement hard categorisation directly:
1
2
3
4
positions ← a l l word p o s i t i o n s i n d
t h a t c o n t a i n t o k e n s found i n T
Return cnb , where
Q
cnb = arg maxci ∈C P (ci ) k∈positions P (tk |ci )
5
Classification Performance
(Mitchell, 1997): Given 1000 training documents from each group, learn to
classify new documents according to which newsgroup it came from
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
alt.atheism
soc.religion.christian
talk.religion.misc
talk.politics.mideast
talk.politics.misc
talk.politics.guns
sci.space
sci.crypt
sci.electronics
sci.med
Naive Bayes: 89% classification accuracy.
Learning performance
• Learning Curve for 20 Newsgroups:
NB: TFIDF and PRTFIDF are non-Bayesian probabilistic methods we
will see later in the course. See (Joachims, 1996) for details.
NB and continuous variables
• Another model: suppose we want our document vectors to represent, say,
the TF-IDF scores of each term in the document:
~d = ha1 = tf idf (t1 ), . . . , an = tf idf (tn )i
• How would we estimate P (c|~d)?
6
(9)
• A: assuming an underlying (e.g. normal) distribution:
n
Y
P (c|~d) ∝
P (ai |c)
i=n
1
√
=
σc 2π
e
−
(x−µc )2
2
2σc
(10)
µb and σb2 are mean and variance of the values taken by the attributes for
positive instances.
Combining variables
• NB also allows you yo combine different types of variables.
• The result would be a Bayesian Network with continuous and discrete
nodes. For instance:
C
a1
...
a2
ak
an
• See (Luz, 2012; Luz and Su, 2010) for examples of the use of such combined
models in a different categorisation task.
Naive but subtle
• Conditional independence assumption is clearly false
Y
P (a1 , a2 . . . an |vj ) =
P (ai |vj )
i
• ...but NB works well anyway. Why?
• posteriors P̂ (vj |x) don’t need to be correct; We need only that:
Y
arg max P̂ (vj )
P̂ (ai |vj ) = arg max P (vj )P (a1 . . . , an |vj )
vj ∈V
vj ∈V
i
In othe words, error in NB classification is a zero-one loss function, often correct even if posteriors are unrealistically close to 1 or 0 (Domingos and Pazzani,
1996).
Performance can be optimal if dependencies are evenly distributed over classes,
or if they cancel each other out (Zhang, 2004).
7
Other Probabilistic Classifiers
• Alternative approaches to probabilistic classifiers attempt to improve effectiveness by:
– adopting weighted document vectors, rather than binary-valued ones
– introducing document length normalisation, in order to correct distortions in CSVi introduced by long documents
– relaxing the independence assumption (the least adopted variant,
since it appears that the binary independence assumption seldom
affects effectiveness)
– But see, for instance Hidden Naive Bayes (Zhang et al., 2005)...
References
Domingos, P. and Pazzani, M. J. (1996). Beyond independence: Conditions for
the optimality of the simple bayesian classifier. In International Conference
on Machine Learning, pages 105–112.
Gale, W., Church, K., and Yarowsky, D. (1992). A method for disambiguating
word senses in a large corpus. Computers and the Humanities, 26:415–439.
Joachims, T. (1996). A probabilistic analysis of the rocchio algorithm with
TFIDF for text categorization. Technical Report CMU-CS-96-118, CMU.
John, G. H. and Langley, P. (1995). Estimating continuous distributions in
Bayesian classifiers. In Besnard, Philippe and Hanks, S., editors, Proceedings
of the 11th Conference on Uncertainty in Artificial Intelligence (UAI’95),
pages 338–345, San Francisco, CA, USA. Morgan Kaufmann Publishers.
Luz, S. (2012). The non-verbal structure of patient case discussions in multidisciplinary medical team meetings. ACM Transactions on Information Systems,
30(3):17:1–17:24.
Luz, S. and Su, J. (2010). Assessing the effectiveness of conversational features
for dialogue segmentation in medical team meetings and in the AMI corpus.
In Proceedings of the SIGDIAL 2010 Conference, pages 332–339, Tokyo. Association for Computational Linguistics.
McCallum, A. and Nigam, K. (1998). A comparison of event models for naive
Bayes text classification. In AAAI/ICML-98 Workshop on Learning for Text
Categorization, pages 41–48. AAAI Press.
Mitchell, T. M. (1997). Machine Learning. McGraw-Hill.
Robertson, S. E. and Jones, K. S. (1988). Relevance weighting of search terms.
In Document retrieval systems, pages 143–160. Taylor Graham Publishing,
London.
8
Sebastiani, F. (2002). Machine learning in automated text categorization. ACM
Computing Surveys, 34(1):1–47.
Yang, Y. (2001). A study on thresholding strategies for text categorization. In
Croft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., editors, Proceedings
of the 24th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR-01), pages 137–145, New York.
ACM Press.
Zhang, H. (2004). The optimality of Naive Bayes. In Proceedings of the 7th International Florida Artificial Intelligence Research Society Conference. AAAI
Press.
Zhang, H., Jiang, L., and Su, J. (2005). Hidden naive bayes. In Proceedings
of the National Conference on Artificial Intelligence, volume 20, page 919.
Menlo Park, CA; Cambridge, MA; London; AAAI Press; MIT Press; 1999.
9