Download Assignment 4b

Assignment 4 Tri-Training: Exploiting Unlabeled DataUsing Three Classifiers Summary This paper is about the use of a new algorithm named tri-training which is proposed to be used to solve the problem of unlabelled training examples for use in data mining applications such as web page classification. In data mining it is difficult to obtain large numbers of labelled examples as you require human input and this algorithm is semi-supervised so it uses smaller numbers of labelled data with which to label larger unlabelled examples. It attempts to generate three classifiers from the training set. The classifiers are then refined using the unlabelled examples and for each round of tri training the unlabelled example is labelled by one classifier and the other two are seen if they agree on the labelling. It makes use of the co-training paradigm proposed by Blum and Mitchell which trains 2 classifiers separately using independent sets of attributes and uses the predictions of each classifier to increase the training of the other. The standard co training requires the attributes to be partitioned into two sets each of which should be conditionally independent of each other so that fewer generalization errors would be made. The constraint of partitioning the data is impractical for most data sets. Other algorithms which try to overcome this requirement instead require time consuming cross validation techniques and if the original labelled example set is rather small the cross validation will exhibit high variance and so is not helpful for model selection. Since tri-training does not put any constraint on the supervised learning algorithm, nor does it employ time-consuming cross validation process, both its applicability and efficiency are better. This benefit of this algorithm is that it doesn’t require the instance space to be described with sufficient and redundant views nor does it put any constraints on the supervised learning algorithm and so it is able to be applied in a wider range than previous co-training styles. It tackles the issue of determining how to label the unlabelled examples and how to produce the final hypothesis. The method in this classifier is that an unlabelled example can be labelled for it as long as the other two classifiers agree on the labelling of this example, while the confidence of the labelling of the classifiers are not needed to be explicitly measured. It can also deal with an increase in the classification noise rate as it can be compensated if the amount ofnewly labelled examples is sufficient. They conducted their experimetns on UCL data sets with web page classification and compared the performance of tri-training with three semi-supervised learning algorithms, i.e.co-training, self-training1, ands elf-training2. They showed that tri-training can effectively improve the hypotheses with all the classifiers under all the unlabel rates. In fact, if the improvements are averaged across all the data sets, classifiers and unlabel rates, it can be found that the average improvement of tri-training is about 11.9%.It is impressive that with all the classifiers and under all the unlabelled rates, tri-training has achieved the biggest average improvement. Counting the number of winning data sets, i.e. the number of data sets on which an algorithm has achieved the biggest improvement among the compared algorithms, tri-training is almost always the winner. This new algorithm does seem seems to be more efficient and its generalization ability better and from the data composes the final hypothesis better than the other algorithm. Moreover, its applicability is wide because it neither requires sufficient and redundant views nor does it put any constraint on the employed supervised learning algorithm so it can be applied in a much broader aspects. It does have the weakness in that it hasn’t overcome the problem of semi-supervised learning algorithms not being stable if the unlabelled examples as are often be wrongly labelled during the learning process. Though it proposes a solution to this problem may be using data editing mechanisms. Another weakness they do conceded when there are sufficient and redundant views, appropriately utilizing them will benefit the learning performance so other algorithms work better when those are available. Improvements could be to increase the classifiers as they are currently limited by three classifiers and better performance is anticipated with more classifiers. The real benefit seems to come from its use in ensemble methods where I can see its ben

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Assignment 4b