Download Learning Classifiers from Imbalanced, Only Positive and Unlabeled

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Learning Classifiers from Imbalanced,
Only Positive and Unlabeled Data Sets
Yetian Chen
Department of Computer Science
Iowa State University
[email protected]
Abstract
In this report, I presented my results to the tasks of 2008 UC
San Diego Data Mining Contest. This contest consists of two
classification tasks based on data from scientific experiment.
The first task is a binary classification task which is to
maximize accuracy of classification on an evenly-distributed
test data set, given a fully labeled imbalanced training data
set. The second task is also a binary classification task, but to
maximize the F1-score of classification on a test data set,
given a partially labeled training set. For task 1, I
investigated several re-sampling techniques in improving the
learning from the imbalanced data. These include SMOTE
(Synthetic Minority Over-sampling Technique), Oversampling by duplicating minority examples, random undersampling. These techniques were used to create new
balanced training data sets. Then three standard classifiers
(Decision Tree, Naïve Bayes, Neural Network) were trained
on the rebalanced training sets and used to classify the test
set. The results showed the re-sampling techniques
significantly improve the accuracy on the test set except for
the Naïve Bayes classifier. For task 2, I implemented twostep strategy algorithm to learn a classifier from the only
positive and unlabeled data. In step 1, I implemented Spy
technique to extract reliable negative (RN) examples. In step
2, I then used the labeled positive examples and the reliable
negative examples as training set to learn standard Naïve
Bayes classifier. The results showed the two-step algorithm
significantly improves the F1 score compared to the learning
that simply regards unlabeled examples as negative ones.
1. Introduction
2008 UC San Diego Data Mining Contest1 consists of two
tasks, both of which are binary classification tasks based on
data from a scientific experiment. The first task is a
standard classification task, which is to maximize accuracy
of classification on a test data set, given a fully labeled
training set. This is a binary classification task that involves
20 real-valued features from an experiment in the physical
sciences. The training data consist of 40,000 examples, but
there are roughly ten times as many negative examples as
positive. Thus, it is a typical class imbalance problem. The
1. http://mill.ucsd.edu/index.php?page=Main
second task is a Positive-Only Semi-Supervised Learning
task which aims to maximize the F1-score of classification
based on a test data set, given a partially labeled training set.
This is also a binary classification problem. But most of the
training examples are unlabeled. In fact, only a few of the
positive examples have labels. There are both positive and
negative unlabeled examples, but there are several times as
many negative training examples as positive. As is in the
Standard Classification Task, there are 20 real-valued
features, but these are not the same features. The task is to
classify the test set examples as accurately as possible,
which is evaluated using F1 score. We call this PU-learning.
1.1 Learning from imbalanced data
The class imbalance problem is prevalent in many
applications, including: fraud/intrusion detection, risk
management,
text
classification,
and
medical
diagnosis/monitoring, etc [7]. It typically occurs when, in a
classification problem, there are many more instances of
some classes than others. In such cases, standard classifiers
tend to be overwhelmed by the large classes and ignore the
small ones. Particularly, they tend to produce high
predictive accuracy over the majority class, but poor
predictive accuracy over the minority class. A number of
solutions to the class-imbalance problem were proposed
both at the data and algorithmic levels. At the data level,
these solutions include many different forms of re-sampling
such as over-sampling and under-sampling. These
techniques modify the prior probability of the majority and
minority class in training set to obtain a more balanced
number of instances in each class. The under sampling
method extracts a smaller set of majority instances while
preserving all the minority instances. This method is
suitable for large-scale applications where the number of
majority samples is tremendous and lessening the training
instances reduces the training time and makes the learning
problem more tractable. In contrast to under-sampling,
over-sampling method increases the number of minority
instances by over-sampling them.
At the algorithmic level, solutions include adjusting the
costs of various classes so as the counter the class
imbalance when training the data, adjusting the decision
threshold, etc.
2. Task 1: Learning Classifiers from
Imbalanced Data Sets
In section 2, I investigated the techniques at the data level,
i.e. re-sampling methods. I employed three re-sampling
techniques: SMOTE (Synthetic Minority Over-sampling
Technique), Over-sampling by duplicating minority
examples, random under-sampling. Using the rebalanced
data, I trained three different classifiers, Decision Tree
(C4.5), Naïve Bayes and Neural Network (with one hidden
layer) and used then to classify the test set.
2.1 Datasets
The training data set consists of 40,000 examples, each of
which involves 20 real-valued features. 3636 of them are
labeled as 1 (positive examples), 36364 of them are labeled
as -1 (negative examples). There are no missing values in
the data set. The test set consists of 10,000 examples, in
which the two classes are evenly distributed.
1.2 Learning from only positive and unlabeled data
More related information about these two date sets can
be reached at [1]. In the experiment, all the datasets are
converted to .arff format to be used by Weka.
Considering a binary classification problem, given a set P
that is an incomplete set of positive instances and a set U of
unlabeled instances that contains both positive and negative
instances, we want to build a classifier to classify the
instances in U or new test set into positive or negative
instances. This problem is called Learning from Only
Positive and unlabeled data or PU-learning. In real life, PUlearning has a lot of applications. For example, there are
over 1000 specialized molecular biology database, each of
which defines a set of positive examples (gene/proteins
related to certain disease or function) but has no
information about examples that should not be included
(and it is unnatural to build such set). Apparently, the
traditional classification techniques are inapplicable since
they all require both labeled positive and negative examples
to build a classifier. Recently, a few algorithms were
proposed to solve the problem [2][3][4]. One class of
algorithms is based on a two-step strategy. These
algorithms include S-EM, PEBL, and Roc-SVM.
2.2 Re-sampling Techniques
SMOTE
SMOTE (Synthetic Minority Over-sampling Technique) is
an over-sampling approach proposed and designed in [8].
They generate synthetic examples in a less applicationspecific manner, by operating in “feature space” rather than
“data space”. The minority class is over-sampled by taking
each minority class sample and introducing synthetic
examples along the line segments joining any/all of the k
minority class nearest neighbors. Depending upon the
amount of over-sampling required, neighbors from the k
nearest neighbors are randomly chosen. Their
implementation currently uses five nearest neighbors. For
instance, if the amount of over-sampling needed is 200%,
only two neighbors from the five nearest neighbors are
chosen and one sample is generated in the direction of each.
Synthetic samples are generated in the following way: take
the difference between the feature vector (sample) under
consideration and its nearest neighbor; multiply this
difference by a random number between 0 and 1, and add it
to the feature vector under consideration. This causes the
selection of a random point along the line segment between
two specific features (Fig 1). This approach effectively
forces the decision region of the minority class to become
more general.
Step 1: Identify a set of reliable negative examples (RN)
from the unlabeled set. In this sep, S-EM uses a Spy
techniques, PEBL uses a techniques called 1-DNF, and
Roc-SVM uses Rocchio algorithm.
Step 2: Building a set of classifier by iteratively applying a
classification algorithm and then selecting a good classifier.
In this step, S-EM uses the Expectation-Maximization (EM)
algorithm with a NB classifier, while PEBL and Roc-SVM
use SVM.
In this report, I implemented a two-step algorithm in which
the step 1 uses Spy technique. After identify a set of
reliable negative examples (RN), I use P (labeled positive
examples) and RN to build a Naïve Bayes classifier. Section
3 includes the implementation details and the results.
Fig 1. Over-Sampling with SMOTE. The minority class is oversampled by taking each minority class sample and introducing
synthetic examples (blue circle) along the line segments joining
any/all of the k (default=5) minority class nearest neighbors (red
circles).
2
15, 20. It is showed the accuracy reaches a plateau after 11
hidden units. Thus, the table only gives the accuracy for the
11 hidden-unit Neural Network.
In the experiment design, the positive examples were oversampled by 900% so that the size of positive class is 36360,
roughly equal to the size of negative class 36364. The
SMOTE technique is embedded in the Weka package:
weka.fiters.supervised.instance.SMOTE.
Table 1 Effect of re-sampling techniques in improving the
classification accuracies on test set
no resampling US
OSbD
OS_SMOTE
DT 0.791
0.828
0.788
0.875
NB 0.834
0.827
0.827
0.838
NN 0.835
0.909
0.904
0.91
Over-sampling by duplicating the minority examples
To produce a contrast to SMOTE, I implemented a simple
over-sampling approach which over-samples the minority
example by simply duplicating the minority examples. In
this experiment, each positive example was duplicated 9
times to make the size of positive class roughly equal to the
size of negative class.
Notation: DT(Decision Tree), NB(Naïve Bayes), NN (Neural
Network). No resampling (no resampling techniques applied), US
(random under-sampling), OSbD (over-sampling by duplications),
OS_SMOTE (over-sampling with SMOTE). For NN, the number
of hidden units is 11.
Random under-sampling
The results in Table 1 are plotted in Fig 1.
As mentioned previously, the under sampling method
extracts a smaller set of majority instances while preserving
all the minority instances. In this experiment, I
implemented an under-sampling approach which randomly
selects a subset of examples from the majority class. For
this data set, 3720 negative examples are randomly selected
from all 36364 negative examples. And all 3636 positive
examples are preserved in the new training set.
Fig 1 Effect of resampling techniques on imbalanced data
1.0
.9
Accuracy
.8
2.3 Building Standard Classifiers
.7
.6
Using the new training data sets, I then trained three
different classifiers: Decision Tree, Naïve Bayes, Neural
Network (with one hidden layer).
.5
.4
Decision Tree
no resampling
Decision Tree classifiers were trained using each of the
three
rebalanced
training
sets.
I
use
weka.classifiers.trees.j48.J48 in WEKA package. When
building the tree, I selected the default pruning option.
US
OSbD
OS_SMOTE
Decision Tree
Naive Bayes
Neural Network(hidden neurons = 11)
Naïve Bayes
Neural Network
For Naïve Bayes classifier, all of the three re-sampling
approaches do not significantly improve the predictive
accuracy on the test set. The accuracies of them are around
0.83, roughly the same as the NB classifier trained using
the original imbalanced data set.
Three-layer feed forward neural networks (one hidden layer)
were trained using the new data sets. I experimented with
different number of hidden units and selected the one with
the best accuracy. I used the default learning rate 0.3 and
momentum rate 0.2. The training algorithm I used is
weka.classifiers.functions.neural.NeuralNetwork.
For Decision Tree Classifier, random under-sampling and
over-sampling with SMOTE significantly improve the
accuracy. Over-sampling with SMOTE gives the best
accuracy (0.875), which shows 8% improvement compared
to the DT classifier trained directly using the imbalanced
data.
Similarly, Naïve Bayes Classifiers were trained on the three
new training set using weka.classifiers.bayes.NaiveBayes
class. 5-fold cross-validation is selected.
For Neural Network, all three re-sampling techniques
significantly improve the predictive accuracy on test set.
Neural Network classifier with over-sampling using
SMOTE gives the best accuracy compared to other
classifiers and re-sampling techniques. Thus, my best
accuracy achieved is 0.91, which is ranked at 52th among all
199 teams. The best accuracy among this ranking is 0.928.
2.4 Results
Table 1 summarizes the accuracies of the classifiers trained
using different training sets in classifying the test set.
When training the Neural Network classifier, I
experimented with different number of hidden units: 5, 11,
3
as negative (lines 3-7). I-EM basically runs NB twice (see
the EM algorithm below). After I-EM completes, the
resulting classifier uses the probabilities assigned to the
documents in S to decide a probability threshold t to
identify possible negative documents in U to produce the
set RN. See [3] for details.
3. Task 2: Learning Classifiers from Only
Positive and Unlabeled Data Sets
3.1 Data Set
One thing to be mentioned, since the 20 features are realvalued, I first discretized them into 10 bins, the width of
which is ( xi ,max − xi ,min ) /10 for each feature i. Then I
computed the posterior probabilities using the suffice
statistics.
The training data set consists of 68,560 examples, each of
which also involves 20 real-valued features. Only 60 of
them are labeled as 1 (positive examples), others are
unlabeled. There are also no missing values in the data set.
The test set consists of 11,427 examples.
More related information about these two date sets can
be reached at [1]. In the experiment, all the datasets are
converted to .arff format to be used by Weka.
Algorithm Step-1
1. N = U = ∅
2. S = sample( P, s%)
3. MS = M ∪ S
4. P = P − S
5. Assign every examples in P the class label 1
6. Assign every examples in MS the class label -1
7. Run I-EM(MS, P)
8. Classify each examples in MS
9. Determine the probability threshold t using S
10. for each example e in M
11.
if its probability Pr[1| x ] < t
12.
N = N ∪ {e}
13.
else U = U ∪ {e}
3.2 Two-step Strategy
Theoretically, the PU-learning problem (Learning from
Only Positive and Unlabeled data) is learnable [3][5]. There
are a number of solutions proposed, among which are a
class of algorithms based on a two-step strategy. In step 1,
these algorithms implemented various techniques in order
to extract a set of reliable negative examples from the
unlabeled examples. In step 2, different classifiers can then
be trained using the reliable negative examples obtained
from step 1. In my project, I employed a Spy technique [3]
in step 1 to extract reliable examples. Then I built a Naïve
Bayes classifier using the labeled positive examples and the
reliable negative examples as the training set.
I-EM(M, P)
1. Building a initial naïve Bayesian classifier NB-C using P as
positive examples, M as negative examples
2. Loop while classifier parameters change
3.
for each example e ∈ M
4.
compute Pr[1| e]
5.
update Pr[ xi |1] and Pr[1] given the probabilistic
Pr[1| e] and P
The two-step strategy is illustrated in Fig 2.
positive
U
Step 1
positive
negative
Reliable
Negative
(RN)
Step 2
Using P, RN and Q to
build the final classifier
iteratively
or
Using only P and RN to
build a classifier
Fig 3. Algorithm for the Spy technique
Step 2: Building a Standard Naïve Bayes Classifier
using P and RN
Q
=U - RN
P
After Step 1, we obtained a set of examples that we believe
are most likely negative examples (RN). Then I used the
labeled positive examples (P) and (RN) to train a Naïve
Bayes classifier. Similarly, I also used the class
weka.classifiers.bayes.NaiveBayes in Weka. 5-fold crossvalidation is selected.
Fig 2. Illustration of the two-step strategy for PU-learning
Step 1: The Spy technique to find reliable negative
instances
The algorithm for Spy technique is given in Fig 3. It first
randomly selects a set S of positive examples from P and
put them in U (lines 2 and 3). The default value for s% is
15%. Examples in S act as “spy” examples from the
positive set to the unlabeled set U. The spies behave
similarly to the unknown positive instances in U. Hence,
they allow the algorithm to infer the behavior of the
unknown positive instances in U. It then runs I-EM
algorithm using the set P − S as positive and the set U ∪ S
3.3 Results
I did two experiments. In the first experiment, I simply
regarded all unlabeled examples (U) as negative examples.
The training set then combines P and U. A Naïve Bayes
classifier was trained using this training set. The second
experiment follows the two-step strategy above. The
4
is 0.721, far better than mine (0.651), which means I still
have a long way to go. There are other alternate two-step
strategies. For example, in step 1, there are various
algorithms to identify reliable negative examples, such as 1DNF, Rocchio algorithm. In step 2, it turns out SVM is a
better classifier to build the final classifier. Thus, some
future work is to try these different two-step strategies, as
well as other non two-step approaches.
performances were evaluated using F1 score from the
classifying the test set.
Table 2 F1 score of the PU-learning
P, N=U
P, N=RN
0.545
0.651
F1 score
Notation: F1 = 2 × Precision × Recall /(Precision+Recall)
When trained with P vs U (U as the negative examples), the
F1 score is 0.545. The two-step strategy gives F1=0.651,
which is significant improvement. The best score among
all teams in the contest is 0.721, which means I still have a
long way to go. Anyway, it is showed the two-step strategy
does improve the predictive power.
Class imbalance and PU-learning are problems that still
worth studying.
References
[1] http://mill.ucsd.edu/index.php?page=Datasets
4. Conclusion and Discussion
[2] B. Liu, Y. Dai, X. Li, W. S. Lee, and P. S. Yu. Building
text classifiers using positive and unlabeled examples. In
Proceedings of the 3rd IEEE International Conference on
Data Mining (ICDM 2003), pages 179–188, 2003.
In this report, I studied two challenging data mining tasks
presented in 2008 UC San Diego Data Mining competition.
The first task is to improve the learning from imbalanced
data sets. For this problem, I investigated a set of resampling approaches, random under-sampling, oversampling by duplicating the minority class, SMOTE
(Synthetic Minority Over-sampling Technique),
in
improving the learning from the imbalanced data sets. I
then built three classifiers using the rebalanced data sets. It
is showed that, for Naive Bayes, the three re-sampling
techniques do not have significant improvement in the
classification accuracy over the test set. For Decision Tree
Classifiers, random under-sampling and SMOTE
significantly improve the accuracy. For Neural Network, all
three re-sampling techniques significantly improve the
accuracy. Neural Network classifier with SMOTE gives the
best accuracy compared to other classifiers and re-sampling
techniques. Although in this case, under-sampling has
peered predictive accuracy with the SMOTE, a problem
associated with it is that we may lose informative instance
from the discarded instance.
[3]B. Liu, W.S.Lee, P.S. Wu, X. Li. Partially Classification
of Text Documents. Proceedings of the Nineteenth
International Conference on Machine Learning (ICML2002), 8-12, July 2002, Sydney, Australia.
[4]Wee Sun Lee, Bing Liu. Learning with Positive and
Unlabeled Examples using Weighted Logistic Regression.
Proceedings of the Twentieth International Conference on
Machine Learning (ICML-2003), August 21-24, 2003,
Washington, DC USA.
[5] C. Elkan and K. Noto. Learning Classifiers from Only
Positive and Unlabeled Data. In Proceedings of the
Fourteenth International Conference on Knowledge
Discovery and Data Mining (KDD'08).
[6]Giang Hoang Nguyen, Abdesselam Bouzerdoum, Son
Lam Phung: A supervised learning approach for imbalanced
data sets. ICPR 2008: 1-4
The competition uses accuracy as the criteria to evaluate the
performance of a classifier. This may not be a good idea,
since higher accuracy does not necessarily imply better
performance on target task. It turns out that AUC (area
under the ROC curve) is a better performance measure than
accuracy.
[7]Nitesh V. Chawla, Nathalie Japkowicz, Aleksander
Kotcz: Editorial: special issue on learning from imbalanced
data sets. SIGKDD Explorations 6(1): 1-6 (2004)
[8]Nitesh V. Chawla et. al. (2002). "SMOTE: Synthetic
Minority Over-sampling Technique". Journal of Artificial
Intelligence Research . Vol.16, pp.321-357.
The second task is a problem of learning from partially
labeled data set. Only a subset of positive examples are
labeled. Others are unlabeled. There are no labeled negative
examples. For this task, I investigated a two-step strategy
for learning this type of data sets. In step 1, I used the Spy
technique to extract reliable negative examples. In step 2, I
then used these reliable negative examples combining the
labeled positive examples to learn a Naïve Bayes classifier.
This two-step strategy give significantly better F1 score
than simply using all unlabeled examples as negative ones.
Furthermore, the best score among all teams in the contest
5