Download Slide

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Learning from Multi-topic Web
Documents for Contextual
Advertisement
KDD 2008
Outline
1. Introduction
2. Sensitive Content Detection
3. Sentiment Classification and Detection
& Opinion Mining
4. Experiments
5. Conclusion
2017/5/22
2
1. Introduction (1/4)
• Contextual advertisement
– A popular advertising paradigm where web page owners
allow ad platforms to place ads on their pages that match
the content of their sites
– Problems:
• The huge variety of content that can appear on a single web page
– e.g. news sites, blogs, etc
• Advertisers do not want to show their ads on pages with content
like violence, pornography etc. (sensitive content)
• They may also not wish to advertise on pages which contain
negative opinion about their products
2017/5/22
3
1. Introduction (2/4)
• Objective
– Not only to tell if a document has some targeted
content in it, but also to label the parts of the
document where the content is present
• Sub-document classification
– Classifier train on entire pages using page-level
labels and test on individual blocks
2017/5/22
4
1. Introduction (3/4)
• Challenges:
– Pages can contain unwanted parts
• e.g., navigation panes, text advertisements, etc
– Pages may also contain information on multiple
topics
– To collect large amounts of broad coverage
single-topic training data, pre-clean and handlabel the blocks are difficult and expensive
2017/5/22
5
1. Introduction (4/4)
• This paper using multiple-instance learning (MIL)
techniques – MILBoost to improve the performance of
traditional methods (Naive-Bayes and Decision tree)
• To train sub-document classifiers using only page level labels
• The problems of sensitive content detection and
opinion/sentiment classification for advertising can be
considered as 2-class and multi-class classifying
• In sentiment detection, a Naive-Bayes based MILBoost
detector performs as well as the best block detector trained
with block-level labels
2017/5/22
6
Outline
1. Introduction
2. Sensitive Content Detection
3. Sentiment Classification and Detection
& Opinion Mining
4. Experiments
5. Conclusion
2017/5/22
7
2. Sensitive Content Detection (1/3)
• Sensitive content categories
– e.g., crime, war, disasters, terrorism, pornography, etc
• The various sensitive categories are grouped as one class
labeled as “sensitive”
• As long as a web page contains such content blocks, it will be
marked as sensitive and the ad display will be turned off
• The available training web pages are labeled at the page-level
– The labels only tell whether a page contains sensitive
content somewhere in it or not
2017/5/22
8
2. Sensitive Content Detection (2/3)
2017/5/22
9
2. Sensitive Content Detection (3/3)
• If using the entire page, traditional classification methods
run the risk of learning everything on the page as
“sensitive”
• To avoid this problem, a classifier that can accurately
identify the parts of the page that contain the targeted
content is needed
• Better still is a classifier that can integrate the two tasks of
locating and learning
– Multiple Instance Learning framework
2017/5/22
10
2.1. Multiple Instance Learning Boosting (1/8)
• Multiple Instance Learning (MIL) is a variation of
supervised learning where labels of training data are
incomplete
• Traditional methods where the label of each individual
training instance is known
• In MIL the labels are known only for groups of
instances
– Bag: a web page
– Instance: a block of text
2017/5/22
11
2.1. Multiple Instance Learning Boosting (2/8)
• 2-class classification (sensitive or non-sensitive)
– A bag is labeled positive if at least one instance in that bag
is positive
– A bag is labeled negative if all the instances in it are
negative
• The goal of MIL algorithm is to produce a content detector
at the sub-document (block) level without having the block
labels in the training data
• This can save significant amount of money and effort by
avoiding labeling work at the block level
2017/5/22
12
2.1. Multiple Instance Learning Boosting (3/8)
• Why MILBoost:
– The state of the art traditional algorithms use boosting
– We needed a framework to accurately measure the
added effectiveness of the MIL framework
– MILBoost has been successfully applied to a similar
problem
• Training a face detector to detect multiple faces in pictures
when only picture level labels are available
2017/5/22
13
2.1. Multiple Instance Learning Boosting (4/8)
Positive
Positive
?
Positive
?
Positive
Positive
?
Training initial
classifier
Negative
Negative
?
Negative
?
Negative
?
Positive
Positive
Negative
Positive
Negative
2017/5/22
Negative
Negative
14
2.1. Multiple Instance Learning Boosting (5/8)
• For each instance xij of bag
positive is given by
Bi , the probability of an instance xij is
1
Pij 
1  exp  yij 
where
yij  C ( xij )   t Ct xij 
t
is the weighted sum of the output of each classifier in ensemble with
t steps
th
C
(
x
)
t
• t ij is the output score of the instance xijgenerated by the
classifier of ensemble
2017/5/22
15
2.1. Multiple Instance Learning Boosting (6/8)
• The probability that the bag is positive is a “noisy OR”
Pi  1  P i  1   1  Pij 
ji
• Under this model the likelihood assigned to a set of
training bags is
L(C )   Pi li (1  Pi ) (1li )
i
where li  0,1 is the label of bag i
2017/5/22
16
2.1. Multiple Instance Learning Boosting (7/8)
• Following the AnyBoost approach, the weight on an instance
is given as
xij
 log L(C ) li  Pi
wij 

Pij
yij
Pi
• Each round of boosting is a search for a classifier Ct 1 which
maximum
 c( x
ij
)  wij
ij
where c ( xij ) is the score assigned to the instance xij of bag i
by the weak classifier (for a binary classifier c( xij )   1,1 )
• The parameter t 1is determined using a line search to
maximum log L(C   C )
t
2017/5/22
t
17
2.1. Multiple Instance Learning Boosting (8/8)
2017/5/22
18
2017/5/22
19
Outline
1. Introduction
2. Sensitive Content Detection
3. Sentiment Classification and Detection
& Opinion Mining
4. Experiments
5. Conclusion and Future Work
2017/5/22
20
3. Sentiment Classification and
Detection & Opinion Mining
• Sentiment/opinion mining from review pages or blogs
– A page may contain one or more topics
– It is common to label reviews as “positive” or “negative”
– Reviews are often not as polar or one sided as the label
indicates
– Blog review sites or discussion forums usually feature
many people expressing varied opinions about the same
product
– These “mixed” opinions may act as noise during the
training of traditional classification methods
2017/5/22
21
3.1 Multi-target MILBoost Algorithm (1/6)
• To apply MILBoost to the multi-topic detection task, it needs to be
extended to a multi-class scenario
• The “positive” and “negative” opinions can be treated as the target
classes and the “neutral” class as the null class
• A bag is labeled as belonging to class k if it contains at least one
instance of class k
• A bag can be multi-labeled since it may contain instances from more
than two different target classes
• To deal with multi-labels
– Creating duplicates of a bag with multiple labels
– Assigning a different label to each duplicate
2017/5/22
22
3.1 Multi-target MILBoost Algorithm (2/6)
• Suppose we have 1 . . .K target classes and class 0 is the null class
• For each instance xij of bag Bi , the probability that
(k  {1, 2, . . . ,K}) is given by a softMax function,
where
xijbelongs to class k
Yijk   t Ctk ( xij )
t
is the weighted sum of the output of each classifier in the ensemble with
t steps
•
Ctk ( xij ) is the output score for class k from instance xij generated by the t th
classifier of the ensemble
2017/5/22
23
3.1 Multi-target MILBoost Algorithm (3/6)
• The probability that a page has label k is the probability
that at least one of its content block has label k
• Assuming that the blocks are independent of each other,
the probability that a bag belongs to any target class k (k >
0) is
(“noisy OR” model)
• The probability that a page is neutral (or belongs to the null
class 0) is the same as the probability that all the blocks in
the page are neutral
2017/5/22
24
3.1 Multi-target MILBoost Algorithm (4/6)
• The log likelihood of all the training data can be given as
• The weight on each instance for next round of training is
given as
• For the null class,
2017/5/22
25
3.1 Multi-target MILBoost Algorithm (5/6)
• Combining weak classifiers
– Once the (t + 1)th classifier is trained, the weight on
the classifier t 1 can be obtained by a line search to
maximize the log likelihood function
• Choice of classifier C t
– In experiments, Naive Bayes and decision trees are
used
t 1
2017/5/22
26
3.1 Multi-target MILBoost Algorithm (6/6)
• Testing
– The new page is divided into blocks and the block level
probabilities are computed using the classifier
– The page level probabilities are obtained by combining
the block level probabilities using noisy-OR
– The block and page level labels are calculated using
thresholds on the probabilities
2017/5/22
27
Outline
1. Introduction
2. Sensitive Content Detection
3. Sentiment Classification and Detection
& Opinion Mining
4. Experiments
5. Conclusion
2017/5/22
28
4.1 Sensitive Content Detection (1/5)
• The data set contains 2,000 web pages which are labeled at the page level
by human annotators
• The label for each web page is binary, either sensitive or non-sensitive
• There is no labeling done at the text block level
• The evaluation has to be done at the web page level
• Two popular base classifiers were used to build the MILBoost ensemble,
decision trees and Naive Bayes
• Both the MILBoost and the non-MILboost versions were run through 30
boosting iterations which end up with an ensemble of 30 classifiers
• Area Under the ROC Curve (AUC) was used to evaluate the effectiveness
of the various detectors
2017/5/22
29
4.1 Sensitive Content Detection (2/5)
Significantly
outperforms
both basealmost
classifiers
and traditional
boosted
The MILBoost
version achieved
the same
performance
as theversions
MILBoost further improves this performance by another 8.2%
boosted page-classifier
2017/5/22
30
4.1 Sensitive Content Detection (3/5)
Althoug the AUC is about the same, the
MILBoosted system is almost
consistently better than the boosted pageclassifier at the early part, where usually
the operation point exists. This “early lift”
brings practical advantage to the
MILBoosted system.
2017/5/22
31
4.1 Sensitive Content Detection (4/5)
• Naive Bayes vs Decision Trees
– Naive Bayes performed much better than decision trees
in this task
– The decision tree ensemble uses only about 700
keywords while NB theoretically uses the whole
vocabulary, which is about 20,000
– The bigger feature set enables NB to generalize better
at the testing stage
2017/5/22
32
4.1 Sensitive Content Detection (5/5)
• A Sensitive Content Detection Demo
2017/5/22
33
4.2.1 Sentence Level Sentiment Detection (1/2)
• The subjectivity dataset from the Cornell movie review data
repository is used as the data set
• 10000 “objective” and “subjective” sentences are labeled
• These sentences were extracted from 3000 reviews, which are
labeled at the review-level as well
• A review is a “page” and a sentence is a “block”
• The MILBoost detector is trained with the review data only using
page-level labels, and then evaluated at the sentence-level with
sentence level labels
• Traditional page-level classifiers using boosted NB and decision
trees are also built as benchmark algorithms for comparison
• A page-level classifier using support vector machines (SVM) is
also trained to compare the performance
2017/5/22
34
4.2.1 Sentence Level Sentiment Detection (2/2)
Train
the
classifiers
theNB
reviews
usingpage-level
page-level
sentence-level
labels
labels
TheMILBoost
highest
inwith
all the
algorithms
and
it is trained
The
SVM
didperformance
not
do as well
as
classifiers
sentence
classification
improves
the
performance
byusing
aboutfor
10%
overlabels,
boosted
decision
trees
comparable
with
theor
best
sentence
detector trained
either
with the
pagewith
the sentence-level
label.with sentence-level labels
2017/5/22
35
4.2.2 Multi-class Sentiment Detection (1/3)
• Sentiment detection is naturally a three-class problem
with “positive”, “negative” and “neutral” as class labels
• The “positive” and “negative” classes are the target
classes and the “neutral” class is the null class in the
MILBoost setup
• In these tasks, only built a multi-class MIL system based
on Naives Bayes
• The evaluation can only be done at the page-level
2017/5/22
36
4.2.2 Multi-class Sentiment Detection (2/3)
MILBoost
The performance
based system
using SVM
improves
is comparable
upop the boosted
to the MILBoost
Naive Bayes
system
classifier
2017/5/22
37
4.2.2 Multi-class Sentiment Detection (3/3)
2017/5/22
38
4.2.3 When does MIL improve on traditional
methods? An Analysis Experiment (1/3)
• This paper hypothesized before that multiple-instance
learning should improve learning of traditional techniques
when the amount of mixed content is high
• The experiments were run on a car review dataset which
contained 113,000 user reviews from MSN Autos
• The objective of the learning task to identify negative
opinions in review texts
• These experiments want to show that as the amount of
mixed content increases, MIL based approach can help
traditional techniques improve
2017/5/22
39
4.2.3 When does MIL improve on traditional
methods? An Analysis Experiment (2/3)
• This data set had an overall review rating score from 0-10
• Assume that if the rating score is 6 or below, there will be some
negative opinions in the review text
• Further split the negative reviews into two subsets, one with rating
scores from 0 to 3 (data 0-3)and the other with ratings from 4 to 6
(data 4-6)
• Presumably, the percentage of negative sentences in “data 0-3” will
be much higher than that in “data 4-6”
• If hypothesis hold right, MIL based techniques should give a bigger
boost in “data 4-6”
2017/5/22
40
4.2.3 When does MIL improve on traditional
methods? An Analysis Experiment (3/3)
With
quality
training
data,
MILBoost
does
give
much
advantage
The
MILBoost
based
system
didin “data
Statistically
Theregood
are
three
times
as
many
pages
4-6”not
assignificant
in “data
0-3”
and the over
entire
notmethods;
improve
over
the
improvement
traditional
traditional
however,
if the
training
data has
aover
high
ratiotoofnegative
mixed ratio
class distribution
is much
highly
biased
towards
positive
with
positive
boosted system
classifiers
content,
then MILBoost
does provide significant
advantages
of 5:1 regular
2017/5/22
41
Outline
1. Introduction
2. Sensitive Content Detection
3. Sentiment Classification and Detection
& Opinion Mining
4. Experiments
5. Conclusion
2017/5/22
42
5. Conclusion
• This paper explored sub-document classification for
contextual advertisement applications where the desired
content appears only in a small part of a multi-topic web
document
• The sub-document classifiers are trained when only page
level labels are available
• It showed that the MILBoost system improve the
performance of the traditional classifiers in such tasks,
especially when the percentage of mixed content is high
• These systems provide good quality block level labels for
free, leading to significant savings in time and cost on
human labeling at the block level
2017/5/22
43
END
2017/5/22
44
AdaBoost
2017/5/22
45
Multi-labelled Document
2017/5/22
46