Download slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Text Classifier Induction: Naive Bayes Classifiers
ML for NLP
Lecturer: Kevin Koidl Assist. Lecturer Alfredo Maldonado
https://www.cs.tcd.ie/kevin.koidl/cs4062/
[email protected], [email protected]
2017
2
Defining a CSV function
I
I
Inductive construction of a text categorization module consists
of defining a Categorization Status Value (CSV) function
CSV for Ranking and Hard classifiers:
I
Ranking classifiers: for each category ci ∈ C, define a function
CSVi with the following signature:
CSVi : D → [0, 1]
I
(1)
Hard classifiers: one can either define CSVi as above and
define a threshold τi above which a document is said to belong
to ci , or constrain CSVi to range over {T , F } directly.
Category membership thresholds
I
Hard classifier status value, CSVih : D → {T , F }, can then be
defined as follows:
(
T if CSVi ≥ τi ,
h
CSVi (d) =
(2)
F otherwise.
I
Thresholds can be determined analytically or experimentally.
I
Analytically derived thresholds are typical of TC systems that
output probability estimates of membership of documents to
categories
I
τi is then determined by decision-theoretic measures (e.g.
utility)
Experimental thresholds
I
CSV thresholding or SCut: Scut stands for optimal
thresholding on the confidence scores of category candidates:
I
Vary τi on Tv and choose the one that maximises effectiveness
I
Proportional thresholding: choose τi s.t. that generality
measure gTr (ci ) is closest to gTv (ci ).
I
RCut or fixed thresholding: stipulate that a fixed number of
categories are to be assigned to each document.
I
See [Yang, 2001] for a survey of thresholding strategies.
ML methods for learning CSV functions
I
Symbolic, numeric and meta-classification methods.
I
Numeric methods implement classification indirectly: the
classification function fˆ outputs a numerical score, hard
classification via thresholding
I probabilistic classifiers, regression methods, ...
I
Symbolic methods usually implement hard classification directly
I e.g.: decision trees, decision rules, ...
I
Meta-classification methods combine results from independent
classifiers
I e.g.: classifier ensembles, committees, ...
Probabilistic classifiers
I
I
The CSV() of probabilistic classifiers produces an estimate of
~ = fˆ(d, c) that an instance
the conditional probability P(c|d)
~
represented as d should be classified as c.
Components of d~ regarded as random variables Ti
(1 ≤ i ≤ |T |)
I
Need to estimate probabilities for all possible representations
i.e. P(c|Ti , . . . , Tn ).
I
Too costly in practice: for discrete case and m possible
nominal values that is O(mT )
I
Independence assumptions help...
7
Conditional independence assumption
I
Using Bayes’ rule we get
P(c|d~j ) =
7
Conditional independence assumption
I
Using Bayes’ rule we get
P(c|d~j ) =
P(c)P(d~j |c)
P(d~j )
(3)
7
Conditional independence assumption
I
Using Bayes’ rule we get
P(c|d~j ) =
I
P(c)P(d~j |c)
P(d~j )
(3)
Naı̈ve Bayes classifiers: assume Ti , . . . , Tn are independent of each
other given the target category:
7
Conditional independence assumption
I
Using Bayes’ rule we get
P(c|d~j ) =
I
P(c)P(d~j |c)
P(d~j )
(3)
Naı̈ve Bayes classifiers: assume Ti , . . . , Tn are independent of each
other given the target category:
P(~d|c) =
|T |
Y
P(tk |c)
(4)
k=1
I
I
maximum a posteriori hypothesis: choose c that maximises (3)
maximum likelihood hypothesis: choose c that maximises P(d~j |c)
(i.e. assume all c’s are equally likely)
Variants of Naive Bayes classifiers
I
multi-variate Bernoulli models, in which features are modelled
as Boolean random variables, and
I
multinomial models where the variables represent count data
[McCallum and Nigam, 1998]
Continuous models which use numeric data representation:
attributes represented by continuous probability distributions
I
8
I
using Gaussian distributions, the conditionals can be estimated
as
(t−µ)2
1
(5)
P(Ti = t|c) = √ e − 2σ2
σ 2π
I
Non-parametric kernel density estimation has also been
proposed [John and Langley, 1995]
Some Uses of NB in NLP
I
Information retrieval [Robertson and Jones, 1988]
I
Text categorisation (see [Sebastiani, 2002] for a survey)
I
Spam filters
I
Word sense disambiguation [Gale et al., 1992]
CSV for multi-variate Bernoulli models
I
Starting from the independence assumption
P(~d|c) =
|T |
Y
P(tk |c)
k=1
I
and Bayes’ rule
P(c|d~j ) =
I
→
−
derive a monotonically increasing function of P(c| d ):
fˆ(d, c)
=
|T |
X
i=1
I
10
P(c)P(d~j |c)
P(d~j )
ti log
P(ti |c)[1 − P(ti |c̄)]
P(ti |c̄)[1 − P(ti |c)]
Need to estimate 2|T |, rather than 2|T | parameters.
(6)
Estimating the parameters
I
For each term ti ∈ T , make:
I
I
nc ← the number of ~d s.t. f (~d, c) = 1
ni ← the number of ~d for which ti = 1 and f (~d, c) = 1
P(ti |c) ←
I
I
I
(7)
(sums in numerator and denominator for smoothing; see next
slides)
nc̄ ← the number of ~d s.t. f (~d, c) = 0
ni¯ ← the number of ~d for which ti = 1 and f (~d, c) = 0
P(ti |c̄) ←
11
ni + 1
nc + 2
ni¯ + 1
nc̄ + 2
(8)
An Alternative: multinomial models
I
An alternative implementation of the Naı̈ve Bayes Classifier is
described in [Mitchell, 1997].
I
In this approach, words appear as values rather than names of
attributes
I
A document representation for this slide would look like this:
~d = ha1 = ”an”, a2 = ”alternative”, a3 = ”implementation”,. . . i
I
Problem: each attribute’s value would range over the entire
vocabulary. Many values would be missing for a typical document.
13
Dealing with missing values
I
what if none of the training instances with target category cj
have attribute value ai ?
13
Dealing with missing values
I
what if none of the training instances with target category cj
have attribute value ai ?
P̂(ai |cj ) = 0, and...
Y
P̂(cj )
P̂(ai |cj ) = 0
i
I
What to do?
13
Dealing with missing values
I
what if none of the training instances with target category cj
have attribute value ai ?
P̂(ai |cj ) = 0, and...
Y
P̂(cj )
P̂(ai |cj ) = 0
i
I
What to do?
I
Smoothing: make Bayesian estimate for P̂(ai |cj )
P̂(ai |cj ) ←
nc + mp
n+m
where:
I
I
I
I
n is number of training examples for which C = cj ,
nc number of examples for which C = cj and A = ai
p is prior estimate for P̂(ai |cj )
m is weight given to prior (i.e. number of “virtual” examples)
14
Learning in multinomial models
1
2
3
4
5
6
NB Learn ( Tr , C )
/∗ c o l l e c t a l l
t o k e n s t h a t o c c u r i n Tr ∗/
T ← a l l d i s t i n c t words and o t h e r t o k e n s i n Tr
/∗ c a l c u l a t e P(cj ) and P(tk |cj ) ∗/
f o r e a c h t a r g e t v a l u e cj i n C do
Tr j ← s u b s e t o f Tr f o r w h i c h t a r g e t v a l u e i s cj
j
7
8
9
10
11
12
13
14
|
P(cj ) ← |Tr
|Tr |
Textj ← c o n c a t e n a t i o n o f a l l t e x t s i n Tr j
n ← t o t a l number o f t o k e n s i n Textj
f o r e a c h word tk i n T do
nk ← number o f t i m e s word tk o c c u r s i n Textj
nk +1
P(tk |cj ) ← n+|T
|
done
done
Note an additional assumption: position is irrelevant, i.e.:
P(ai = tk |cj ) = P(am = tk |cj ) ∀i, m
Sample Classification Algorithm
I
Could calculate posterior probabilities for soft classification
n
Y
fˆ(d) = P(c)
P(tk |c)
k=1
(where n is then number of tokens in d that occur in T ) and use
thresholding as before
I
1
2
3
4
15
Or, for SLTC, implement hard categorisation directly:
positions ← a l l word p o s i t i o n s i n d
that c o n t a i n tokens found i n T
R e t u r n cnb , where
Q
cnb = arg maxci ∈C P(ci ) k∈positions P(tk |ci )
Classification Performance
[Mitchell, 1997]: Given 1000 training documents from each group,
learn to classify new documents according to which newsgroup it
came from
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
alt.atheism
soc.religion.christian
talk.religion.misc
talk.politics.mideast
talk.politics.misc
talk.politics.guns
sci.space
sci.crypt
sci.electronics
sci.med
Naive Bayes: 89% classification accuracy.
Learning performance
I
Learning Curve for 20 Newsgroups:
NB: TFIDF and PRTFIDF are non-Bayesian probabilistic
methods we will see later in the course. See [Joachims, 1996]
for details.
17
NB and continuous variables
I
Another model: suppose we want our document vectors to
represent, say, the TF-IDF scores of each term in the
document:
~d = ha1 = tfidf (t1 ), . . . , an = tfidf (tn )i
I
18
How would we estimate P(c|~d)?
(9)
NB and continuous variables
I
Another model: suppose we want our document vectors to
represent, say, the TF-IDF scores of each term in the
document:
~d = ha1 = tfidf (t1 ), . . . , an = tfidf (tn )i
I
How would we estimate P(c|~d)?
I
A: assuming an underlying (e.g. normal) distribution:
P(c|~d) ∝
n
Y
(9)
P(ai |c)
i=n
=
1
√
σc 2π
e
−
(x−µc )2
2σc2
(10)
µb and σb2 are mean and variance of the values taken by the
attributes for positive instances.
18
Combining variables
I
NB also allows you yo combine different types of variables.
I
The result would be a Bayesian Network with continuous and
discrete nodes. For instance:
C
a1
I
19
a2
...
ak
an
See [Luz, 2012, Luz and Su, 2010] for examples of the use of
such combined models in a different categorisation task.
Naive but subtle
I
Conditional independence assumption is clearly false
Y
P(a1 , a2 . . . an |vj ) =
P(ai |vj )
i
I
20
...but NB works well anyway. Why?
Naive but subtle
I
Conditional independence assumption is clearly false
Y
P(a1 , a2 . . . an |vj ) =
P(ai |vj )
i
I
...but NB works well anyway. Why?
I
posteriors P̂(vj |x) don’t need to be correct; We need only
that:
Y
arg max P̂(vj )
P̂(ai |vj ) = arg max P(vj )P(a1 . . . , an |vj )
vj ∈V
i
vj ∈V
In othe words, error in NB classification is a zero-one loss function, often
correct even if posteriors are unrealistically close to 1 or 0
[Domingos and Pazzani, 1996].
Performance can be optimal if dependencies are evenly distributed over
classes, or if they cancel each other out [Zhang, 2004].
20
Other Probabilistic Classifiers
I
Alternative approaches to probabilistic classifiers attempt to
improve effectiveness by:
I
I
I
I
adopting weighted document vectors, rather than
binary-valued ones
introducing document length normalisation, in order to correct
distortions in CSVi introduced by long documents
relaxing the independence assumption (the least adopted
variant, since it appears that the binary independence
assumption seldom affects effectiveness)
But see, for instance Hidden Naive Bayes
[Zhang et al., 2005]...
References
I
Domingos, P. and Pazzani, M. J. (1996).
Beyond independence: Conditions for the optimality of the simple bayesian classifier.
In International Conference on Machine Learning, pages 105–112.
Gale, W., Church, K., and Yarowsky, D. (1992).
A method for disambiguating word senses in a large corpus.
Computers and the Humanities, 26:415–439.
Joachims, T. (1996).
A probabilistic analysis of the rocchio algorithm with TFIDF for text categorization.
Technical Report CMU-CS-96-118, CMU.
John, G. H. and Langley, P. (1995).
Estimating continuous distributions in Bayesian classifiers.
In Besnard, Philippe and Hanks, S., editors, Proceedings of the 11th Conference on Uncertainty in Artificial
Intelligence (UAI’95), pages 338–345, San Francisco, CA, USA. Morgan Kaufmann Publishers.
Luz, S. (2012).
The non-verbal structure of patient case discussions in multidisciplinary medical team meetings.
ACM Transactions on Information Systems, 30(3):17:1–17:24.
Luz, S. and Su, J. (2010).
Assessing the effectiveness of conversational features for dialogue segmentation in medical team meetings
and in the AMI corpus.
In Proceedings of the SIGDIAL 2010 Conference, pages 332–339, Tokyo. Association for Computational
Linguistics.
22
References
II
McCallum, A. and Nigam, K. (1998).
A comparison of event models for naive Bayes text classification.
In AAAI/ICML-98 Workshop on Learning for Text Categorization, pages 41–48. AAAI Press.
Mitchell, T. M. (1997).
Machine Learning.
McGraw-Hill.
Robertson, S. E. and Jones, K. S. (1988).
Relevance weighting of search terms.
In Document retrieval systems, pages 143–160. Taylor Graham Publishing, London.
Sebastiani, F. (2002).
Machine learning in automated text categorization.
ACM Computing Surveys, 34(1):1–47.
Yang, Y. (2001).
A study on thresholding strategies for text categorization.
In Croft, W. B., Harper, D. J., Kraft, D. H., and Zobel, J., editors, Proceedings of the 24th Annual
International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR-01),
pages 137–145, New York. ACM Press.
Zhang, H. (2004).
The optimality of Naive Bayes.
In Proceedings of the 7th International Florida Artificial Intelligence Research Society Conference. AAAI
Press.
23
References
III
Zhang, H., Jiang, L., and Su, J. (2005).
Hidden naive bayes.
In Proceedings of the National Conference on Artificial Intelligence, volume 20, page 919. Menlo Park, CA;
Cambridge, MA; London; AAAI Press; MIT Press; 1999.
24