Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction Objective Project Plan Data Selection Property Preprocessing Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Conclusion 2 Korea University , Industrial System Information Engineering 2017-05-22 Experience Data Mining as a part of KDD processes Focused on using various Data Mining Techniques Our objective is find a model(classifier) Estimate constructed models 3 C = f(A) Used R GUI version 2.9.0 with Tinn-R version 1.17.2.4 Korea University , Industrial System Information Engineering 2017-05-22 4 4/10 First Team meeting ~4/26 4/28 Submit a initial Proposal 5/10 Change the subject of the project ~5/27 5/29 Write out a modified Proposal 6/4 Submit a modified Proposal 6/6 Decision Tree and SVM classifier modeling 6/10 Ensemble & ANN model construction 6/16 Integrate the results and Typing final report 6/18 Submit a Final Report and Presentation Find a exist research, data set for the project Try to get a suitable data set for the project Korea University , Industrial System Information Engineering 2017-05-22 Introduction Objective Project Plan Data Selection Property Preprocessing Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Conclusion 5 Korea University , Industrial System Information Engineering 2017-05-22 Thyroid Disease Data set from UCI Machine Learning Repository Attributes 29 Nominal(T/F, M/F, etc.) and Ratio Attributes Nominal attributes have text values Some highly correlated attributes Data Instances 6 (http://archive.ics.uci.edu/ml/datasets/Thyroid+Disease) 2800 training instances which contain some missing values 972 test instances also contain some missing values Korea University , Industrial System Information Engineering 2017-05-22 Parallel Coordinate Plot Example code parallel(~hypo.data[,1:22]) 7 There are too many attributes to analysis correlation between attributes and classes Korea University , Industrial System Information Engineering 2017-05-22 Parallel Coordinate Plot Example code attach(hypo.data) parallel(~hypo.data + [,c(1,2,17,18,19,20,21,22)] + | Diagnosis, + groups=Diagnosis)) 8 According to this, attribute FTI, TT4 may classify primary and compensated hypothyroid Korea University , Industrial System Information Engineering Dimensionality Reduction • Eliminate highly correlated attributes • Select meaningful attributes Control Anomaly/Missing Values • Replace these with estimated values Attribute Transformation 9 • Text values to integer values Korea University , Industrial System Information Engineering 2017-05-22 Dimensionality Reduction For each instance, attribute TSH, T3, TT4, T4U, FTI have unknowns when the values of each measured are FALSE 10 (29 attributes to 22) Replace unknowns with zero e.g) If a value of TSH measured is FALSE then a value of TSH is unknown ; TSH measured has high correlation with TSH Each measured is meaningless attribute DELETE ATTRIBUTES Values of TBG measured are all FALSE, moreover TBG values are all unknown also DELETE ATTRIBUTES ID : Nominal Attribute which is worth to identify uniqueness of instance DELETE ATTRIBUTES Korea University , Industrial System Information Engineering 2017-05-22 Anomaly 11 It is supposed to input the value of age 45 or 55 Replace 455 to 50 Korea University , Industrial System Information Engineering 2017-05-22 Missing Value 12 We decide to choose some patients who are similar to the patient missed Age value. Finally, we chose 2 patients using Excel then replaced missed age value with a mean of 2 values Korea University , Industrial System Information Engineering 2017-05-22 Missing Value 13 Replaced with all possible values with prob. distribution (1:2) Korea University , Industrial System Information Engineering 2017-05-22 Attribute Transformation All of Nominal Attributes except SEX have TRUE/FALSE values Attribute SEX has MALE/FEMALE values, also text values 14 Transform these text values to integer values 0(FALSE) and 1(TRUE) Transform to integer values 1(MALE) and 2(FEMALE) Korea University , Industrial System Information Engineering 2017-05-22 Introduction Objective Project Plan Data Selection Property Preprocessing Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Conclusion 15 Korea University , Industrial System Information Engineering 2017-05-22 16 We decided to construct first classification model by using decision tree Decision Tree is a method easily building a classifier It is based Hunt’s Algorithm Measurement of the impurity of leaf nodes is Entropy Korea University , Industrial System Information Engineering 2017-05-22 We used Tree library to branch decision tree in RGui Example code 17 library(tree) hypo.tree <- tree(Diagnosis ~ ., data = hypo.data) pred.tree <- predict(hypo.tree, x, type=c("class")) table(pred.tree,y) plot(hypo.tree, type = c("uniform“);text(hypo.tree, cex = 0.7) Korea University , Industrial System Information Engineering 2017-05-22 18 Cross Validation of the Decision Tree According to this result, it is estimated that an optimal model with low deviance when the number of the leaf nodes is 7 Korea University , Industrial System Information Engineering 2017-05-22 19 Decision Tree Korea University , Industrial System Information Engineering 2017-05-22 Training Set Accuracy = 2784/2800 = 0.9943 Actual Class Compensated Hypothyroid Compensated Hypothyroid Predicted Class 20 Negative Primary Hypothyroid Secondary Hypothyroid 153 3 6 0 Negative 1 2573 0 2 Primary Hypothyroid 0 4 58 0 Secondary Hypothyroid 0 0 0 0 Too low Entropy of original dataset(0.4720 ; max 2) Korea University , Industrial System Information Engineering 2017-05-22 Test Set Accuracy = 968/972 = 0.9959 Actual Class Compensated Hypothyroid Predicted Class 21 Negative Primary Hypothyroid Secondary Hypothyroid Compensated Hypothyroid 40 2 1 0 Negative 0 898 0 0 Primary Hypothyroid 0 1 30 0 Secondary Hypothyroid 0 0 0 0 Korea University , Industrial System Information Engineering 2017-05-22 22 Support Vector Machine(SVM) is a efficient model to classify instances by finding linear or non-linear hyper plane It is suitable model when data set has multi dimension Very hard to visualize all data instances with many attributes, however two attributes with some slices, we can visualize instances include relationship between attributes Korea University , Industrial System Information Engineering 2017-05-22 23 SVM modeling using R Korea University , Industrial System Information Engineering 2017-05-22 24 We thought attribute FTI and TT4 are suitable to separate instances This figure shows that how attribute FTI and TT4 separate data set instances, but all of records in this area are classified as negative Korea University , Industrial System Information Engineering 2017-05-22 25 Now, change the axis and give some slices which give us reduction of dimensions The area painted with light pink suggests that the class of instances in that area would be predicted primary hypothyroid Korea University , Industrial System Information Engineering 2017-05-22 Prediction of Training Set Accuracy = 2658/2800 = 0.9493 Actual Class Compensated Hypothyroid Predicted Class 26 Negative Primary Hypothyroid Secondary Hypothyroid Compensated Hypothyroid 30 0 2 0 Negative 122 2579 13 2 Primary Hypothyroid 2 1 49 0 Secondary Hypothyroid 0 0 0 0 Too low Entropy of original dataset(0.4720 ; max 2) Korea University , Industrial System Information Engineering 2017-05-22 Prediction of Test Set Accuracy = 933/972 = 0.9599 Actual Class Compensated Hypothyroid Predicted Class 27 Negative Primary Hypothyroid Secondary Hypothyroid Compensated Hypothyroid 11 0 4 0 Negative 29 901 6 0 Primary Hypothyroid 0 0 21 0 Secondary Hypothyroid 0 0 0 0 Korea University , Industrial System Information Engineering 2017-05-22 Concept of ANN 28 An artificial neural network, usually called “neural network” is a computational model that tries to simulate the structure and/or functional aspects of biological neural networks In most cases an ANN is an adaptive system that changes its structure based on external or internal information that flows through the network during the phase In a neural network model, simple nodes are connected together to form a network of nodes Its practical use comes with algorithms designed to alter the strength(weights) of the connections in the network to produce a desired signal flow Korea University , Industrial System Information Engineering 2017-05-22 29 Training / Test error rate According to this result, the number of hidden nodes be used in ANN would be 18 Korea University , Industrial System Information Engineering 2017-05-22 Construction of ANN classifier Used nnet library Example code y <- hypo.data$Diagnosis hypo.ann <- nnet(Diagnosis~., + hypo.data, size=18, + decay=5e-4, maxit=300) hypo.ann summary(hypo.ann) pred.ann <- predict(hypo.ann, + hypo.data, type="class") table(pred.ann,y) 30 Korea University , Industrial System Information Engineering 2017-05-22 31 A 22-18-4 network with 490 weights Korea University , Industrial System Information Engineering 2017-05-22 32 A 22-18-4 network Korea University , Industrial System Information Engineering 2017-05-22 Prediction of Training Set Accuracy = 2798/2800 = 0.9993 Actual Class Compensated Hypothyroid Compensated Hypothyroid Predicted Class 33 Negative Primary Hypothyroid Secondary Hypothyroid 152 0 0 0 Negative 0 2580 0 0 Primary Hypothyroid 2 0 64 0 Secondary Hypothyroid 0 0 0 2 Most high training accuracy ever than other model Korea University , Industrial System Information Engineering 2017-05-22 Prediction of Test Set Accuracy = 954/972 = 0.9815 Actual Class Compensated Hypothyroid Predicted Class 34 Negative Primary Hypothyroid Secondary Hypothyroid Compensated Hypothyroid 35 7 1 0 Negative 4 891 2 0 Primary Hypothyroid 1 3 28 0 Secondary Hypothyroid 0 0 0 0 Korea University , Industrial System Information Engineering 2017-05-22 Bagging – Algorithm Sampling with replacement Build a classifier on each bootstrap sample Step 1 As known as Bootstrap aggregation Sampling B bootstraps from the sample with size N then construct classifier models from each bootstrap sample. Step 2 Aggregate B decision trees from step 1 C * ( x) argmax y i i Step 3 35 C ( x) y Assign class to a majority of values from step 2 Korea University , Industrial System Information Engineering 2017-05-22 36 Korea University , Industrial System Information Engineering 2017-05-22 37 Example of Majority Vote (Tree 1) Korea University , Industrial System Information Engineering 2017-05-22 38 Example of Majority Vote (Tree 2) Korea University , Industrial System Information Engineering 2017-05-22 39 Example of Majority Vote (Tree 3) Korea University , Industrial System Information Engineering 2017-05-22 40 Example of Majority Vote (Tree 4) Korea University , Industrial System Information Engineering 2017-05-22 41 Example of Majority Vote (Tree 5) Korea University , Industrial System Information Engineering 2017-05-22 Example of Majority Vote 43 According to majority vote, a class of 80th instance is predicted to NEGATIVE ; it is same as actual class Korea University , Industrial System Information Engineering 2017-05-22 44 Korea University , Industrial System Information Engineering 2017-05-22 45 Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering 2017-05-22 46 Result of Bagging (Majority Vote) Korea University , Industrial System Information Engineering 2017-05-22 Bagging Accuracy = 2790/2800 = 0.9964 Actual Class Compensated Hypothyroid Compensated Hypothyroid Predicted Class 47 Negative Primary Hypothyroid Secondary Hypothyroid 154 4 0 0 Negative 0 2572 0 2 Primary Hypothyroid 0 4 64 0 Secondary Hypothyroid 0 0 0 0 Secondary Hypothyroid is misclassified again Korea University , Industrial System Information Engineering 2017-05-22 Introduction Objective Project Plan Data Selection Property Preprocessing Various Approaches to Classify the Thyroid Disease C4.5 / C5 SVM ANN Ensemble Conclusion 48 Korea University , Industrial System Information Engineering 2017-05-22 It was a valuable experience to us by mining data from raw data sets Limitation of our project is that the Data set we chose has not enough distribution of classes 49 e.g) the instances those class is secondary hypothyroid are just two Since not enough number of instances, the models we constructed are may be misclassify classes ; especially secondary hypothyroid Korea University , Industrial System Information Engineering 2017-05-22 50 Data Mining techniques can be applied to pathology to diagnose disease. We can also use data mining techniques in another medical decision. Using in MRI or CT scan may be good example. Because our R programming skill is too short, we could not do what we want to perfectly. So, there are some researches which are resulted from by J.R. Quinlan. We referred to these, when branching decision trees Korea University , Industrial System Information Engineering 2017-05-22 51 As comparing training error, ANN model was best classifier however comparing test error, decision tree classifies instances well The most attributes of data set we used are consisted the type of TRUE or FALSE data. Because of strength of decision tree when it treats discrete values, they are done well An Ensemble model with decision tree by using bagging method, was very accurate also, because of its majority voting rule However, the number of instance is too small and initial entropy value is too low, it was hard to classifying small class. Otherwise, ANN model only classified classes well despite of its very small size even the number of this instances is only two Korea University , Industrial System Information Engineering 2017-05-22 52 To diagnose some serious diseases in pathology is very fascinating, but critical. For example, we can diagnose a patient as normal even though he/she had very critical disease like a lung cancer For this reason, we think it should be applied very huge cost to misclassify patients as normal/negative and consider not only error rate of the model but also the costs of prediction Since there are many considerations of putting costs, it is hard to estimate costs accurately. we couldn’t applied to our models Even this classifier can diagnose thyroid disease, the right of final decision in doctor Korea University , Industrial System Information Engineering 2017-05-22 53 Korea University , Industrial System Information Engineering 2017-05-22