Download Assignment 4b

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Assignment 4
Tri-Training: Exploiting Unlabeled DataUsing Three
Classifiers Summary
This paper is about the use of a new algorithm named tri-training which is proposed to be used to solve the
problem of unlabelled training examples for use in data mining applications such as web page classification. In
data mining it is difficult to obtain large numbers of labelled examples as you require human input and this
algorithm is semi-supervised so it uses smaller numbers of labelled data with which to label larger unlabelled
examples. It attempts to generate three classifiers from the training set. The classifiers are then refined using
the unlabelled examples and for each round of tri training the unlabelled example is labelled by one classifier
and the other two are seen if they agree on the labelling.
It makes use of the co-training paradigm proposed by Blum and Mitchell which trains 2 classifiers separately
using independent sets of attributes and uses the predictions of each classifier to increase the training of the
other. The standard co training requires the attributes to be partitioned into two sets each of which should be
conditionally independent of each other so that fewer generalization errors would be made. The constraint of
partitioning the data is impractical for most data sets. Other algorithms which try to overcome this requirement
instead require time consuming cross validation techniques and if the original labelled example set is rather small
the cross validation will exhibit high variance and so is not helpful for model selection. Since tri-training does not
put any constraint on the supervised learning algorithm, nor does it employ time-consuming cross validation
process, both its applicability and efficiency are better.
This benefit of this algorithm is that it doesn’t require the instance space to be described with sufficient and
redundant views nor does it put any constraints on the supervised learning algorithm and so it is able to be
applied in a wider range than previous co-training styles.
It tackles the issue of determining how to label the unlabelled examples and how to produce the final hypothesis.
The method in this classifier is that an unlabelled example can be labelled for it as long as the other two classifiers
agree on the labelling of this example, while the confidence of the labelling of the classifiers are not needed to be
explicitly measured. It can also deal with an increase in the classification noise rate as it can be compensated if
the amount ofnewly labelled examples is sufficient.
They conducted their experimetns on UCL data sets with web page classification and compared the performance
of tri-training with three semi-supervised learning algorithms, i.e.co-training, self-training1, ands elf-training2.
They showed that tri-training can effectively improve the hypotheses with all the classifiers under all the unlabel
rates. In fact, if the improvements are averaged across all the data sets, classifiers and unlabel rates, it can be found
that the average improvement of tri-training is about 11.9%.It is impressive that with all the classifiers and under
all the unlabelled rates, tri-training has achieved the biggest average improvement. Counting the number of
winning data sets, i.e. the number of data sets on which an algorithm has achieved the biggest improvement among
the compared algorithms, tri-training is almost always the winner.
This new algorithm does seem seems to be more efficient and its generalization ability better and from the data
composes the final hypothesis better than the other algorithm. Moreover, its applicability is wide because it neither
requires sufficient and redundant views nor does it put any constraint on the employed supervised learning
algorithm so it can be applied in a much broader aspects.
It does have the weakness in that it hasn’t overcome the problem of semi-supervised learning algorithms not being
stable if the unlabelled examples as are often be wrongly labelled during the learning process. Though it proposes
a solution to this problem may be using data editing mechanisms. Another weakness they do conceded when there
are sufficient and redundant views, appropriately utilizing them will benefit the learning performance so other
algorithms work better when those are available.
Improvements could be to increase the classifiers as they are currently limited by three classifiers and better
performance is anticipated with more classifiers. The real benefit seems to come from its use in ensemble methods
where I can see its ben