Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining with Oracle using Clustering and Classification Algorithms Presented by Nhamo Mdzingwa Supervisor: John Ebden Overview of Presentation Objective of Research Background Methodology Approach Implementation Results Conclusions Questions Problem statement 1 Objective of Research Evaluate two types of algorithms available in Oracle10g for data mining (ODM) To determine which algorithm builds the most effective model and under what circumstances And which model produces the most accurate results when applied to new data Problem statement 2 Objective of Research Gather information from mined dataset Find prevention predictors of HIV AIDS To do this distinguish clusters Or use other mining algorithms to achieve goal Introduction Background Data mining is a powerful and new technology. Steered by the revolutionary progress in digital data acquisition and storage which has resulted in the creation of huge databases Definition Background It is a process of extracting knowledge from large amounts of data, or simply knowledge discovery in databases Is the finding of interesting patterns in data Data mining tool Methodology Oracle10g database release 1 was installed and configured Oracle data miner 10g (ODM) was also installed and configured for use with the database Algorithms in ODM Methodology Classification Adaptive Bayes Network Naive Bayes Model Seeker Association rules Apriori Clustering k-Means O-Cluster Clustering Algorithms Methodology Clustering algorithms support identifying naturally occurring groupings within the data population. K-Means Minimum Error Tolerance and Maximum Iterations Maximum number of Clusters (k) O-Cluster Sensitivity Maximum number of Clusters (k) Dataset used Methodology Obtained from the Centre for AIDS Development, Research and Evaluation Institute for Social and Economic Research, Rhodes University Bases on a questionnaire survey HIV AIDS related Tsha Tsha - HIV AIDS awareness program Dataset used Methodology 2 Data sets put into database tables TSHA_TSHA_BUILD1 500 records Used to build and test models TSHA_TSHA_APPLY1 399 records Used to validate models Methodology Determining model accuracy Confidence is a measure of the homogeneity of the cluster; that is, how close together are the cluster members The support is a measure of the relative size of a cluster (the total need not be 1.00), such that the higher the value the larger the cluster Methodology Building and Testing the Models 20 models built in total The building done in 2 phases 1) Distinct number of clusters 2) Equal number of clusters Algorithm settings: based on Trial and Error Methodology settings 1st phase model building Methodology 1st phase model Accuracy Methodology nd 2 phase model building To overcome the problem (bias) I decided to set k the maximum number of clusters to a fixed value. I set the value k to 7 for all cluster build in this phase Methodology 2nd phase model results Methodology Applying the best models The most accurate models BUILD3_OC_TSHATSHA2 from the O-Cluster BUILD5_KM_TSHATSHA2 from the K-Means were applied to the new data TSHA_TSHA_APPLY1 Methodology Determining Cluster Quality Adopt and implement the evaluation technique by [Roiger et al, 2003] involves employing supervised learning to evaluate unsupervised learning. Decide to use classification (ABN) ODM has classification algorithms ABN algorithm has been identified as most accurate in previous research MethodologyTechnique Supervised Learning for Unsupervised Model Evaluation Designate each formed cluster as a class and assign each class an arbitrary name. Choose a random sample of instances from each class for supervised learning. Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model. MethodologyTechnique Apply ABN model to remaining instances Build Classification model Using ABN Methodology Comparison of ClusterIDs CLASSIFICATION TABLE OC_APPLY_ABN KM_APPLY_ABN CLUSTER TABLE Vs APPLY_OC3_TSHATSHA remaining instances O-cluster model results Vs APPLY_KM5_TSHATSHA remaining instances K-Means models results Methodology Comparison of ClusterIDs DATA SOURCE ClusterIDs in BOTH TABLES PERCENTAGE of ClusterIDs in both models For O-Cluster results 42 out of 107 39% For K-Means results 18 out of 107 17% defining predictors Determining HIV Predictors HIV AIDS predictors of prevention behavior are attributes within our dataset that influence an individual to: (A) use a condom when he/she decides to be sexually active (B) lead to abstaining from having sexual intercourse for at least a year or more (C) attributes that lead to one having fewer sexual partners. Methodology Determining HIV Predictors 2 techniques used to achieve these Distinguishing the clusters found by the O- Cluster model and employing association rule (Apriori) Applied to 2 datasets Cluster found by O-Cluster model Dataset O-Cluster model was applied to. predictors found Determining HIV Predictors On distinguishing clusters found, the attributes HIV test and Know Aids were identified as predictors of condom use and abstinence While from the associations the attributes HIV test and talk openly have been identified as predictors of condom use. The predictors Determining HIV Predictors HIV test – if one has had an HIV test Know Aids – if one knows about AIDS Talk openly – if one talks openly about HIV AIDS or not Regarding the evaluation Conclusions The O-Cluster algorithm produced most effective model: accuracy 95.5% When applied to new data 39% Most effective model by K-Means: accuracy of 86.9% When applied to new data 17% Regarding ODM Algorithms Conclusions classification 1. 2. 3. Adaptive Bayes Network Naive Bayes Model Seeker clustering 1. 2. k-Means O-Cluster association rules 1. Apriori (association rules) observations Conclusions Model accuracy somehow indicates performance of model on new data Therefore it is recommended that one finds the most accurate model for accurate results