Download Final Project presentation (20 min)

Data Mining with Oracle using Clustering and Classification Algorithms Presented by Nhamo Mdzingwa Supervisor: John Ebden Overview of Presentation    Objective of Research Background Methodology  Approach  Implementation    Results Conclusions Questions Problem statement 1 Objective of Research Evaluate two types of algorithms available in Oracle10g for data mining (ODM)  To determine which algorithm builds the most effective model and under what circumstances  And which model produces the most accurate results when applied to new data  Problem statement 2 Objective of Research  Gather information from mined dataset  Find prevention predictors of HIV AIDS To do this distinguish clusters  Or use other mining algorithms to achieve goal  Introduction Background Data mining is a powerful and new technology.  Steered by the revolutionary progress in digital data acquisition and storage which has resulted in the creation of huge databases  Definition Background It is a process of extracting knowledge from large amounts of data,  or simply knowledge discovery in databases  Is the finding of interesting patterns in data  Data mining tool Methodology   Oracle10g database release 1 was installed and configured Oracle data miner 10g (ODM) was also installed and configured for use with the database Algorithms in ODM Methodology Classification    Adaptive Bayes Network Naive Bayes Model Seeker Association rules  Apriori Clustering   k-Means O-Cluster Clustering Algorithms Methodology    Clustering algorithms support identifying naturally occurring groupings within the data population. K-Means  Minimum Error  Tolerance and Maximum Iterations  Maximum number of Clusters (k) O-Cluster  Sensitivity  Maximum number of Clusters (k) Dataset used Methodology Obtained from the Centre for AIDS Development, Research and Evaluation Institute for Social and Economic Research, Rhodes University  Bases on a questionnaire survey   HIV AIDS related  Tsha Tsha - HIV AIDS awareness program Dataset used Methodology 2 Data sets put into database tables  TSHA_TSHA_BUILD1 500 records  Used to build and test models   TSHA_TSHA_APPLY1 399 records  Used to validate models  Methodology Determining model accuracy  Confidence is a measure of the homogeneity of the cluster; that is, how close together are the cluster members  The support is a measure of the relative size of a cluster (the total need not be 1.00), such that the higher the value the larger the cluster Methodology Building and Testing the Models 20 models built in total  The building done in 2 phases  1) Distinct number of clusters 2) Equal number of clusters  Algorithm settings:  based on Trial and Error Methodology settings 1st phase model building Methodology 1st phase model Accuracy Methodology nd 2 phase model building To overcome the problem (bias)  I decided to set k the maximum number of clusters to a fixed value.  I set the value k to 7 for all cluster build in this phase  Methodology 2nd phase model results Methodology Applying the best models  The most accurate models  BUILD3_OC_TSHATSHA2 from the O-Cluster  BUILD5_KM_TSHATSHA2 from the K-Means  were applied to the new data TSHA_TSHA_APPLY1 Methodology Determining Cluster Quality Adopt and implement the evaluation technique by [Roiger et al, 2003]  involves employing supervised learning to evaluate unsupervised learning.  Decide to use classification (ABN)   ODM has classification algorithms  ABN algorithm has been identified as most accurate in previous research MethodologyTechnique Supervised Learning for Unsupervised Model Evaluation    Designate each formed cluster as a class and assign each class an arbitrary name. Choose a random sample of instances from each class for supervised learning. Build a supervised model from the chosen instances. Employ the remaining instances to test the correctness of the model. MethodologyTechnique Apply ABN model to remaining instances Build Classification model Using ABN Methodology Comparison of ClusterIDs CLASSIFICATION TABLE OC_APPLY_ABN KM_APPLY_ABN CLUSTER TABLE Vs APPLY_OC3_TSHATSHA remaining instances O-cluster model results Vs APPLY_KM5_TSHATSHA remaining instances K-Means models results Methodology Comparison of ClusterIDs DATA SOURCE ClusterIDs in BOTH TABLES PERCENTAGE of ClusterIDs in both models For O-Cluster results 42 out of 107 39% For K-Means results 18 out of 107 17% defining predictors Determining HIV Predictors  HIV AIDS predictors of prevention behavior are attributes within our dataset that influence an individual to:  (A) use a condom when he/she decides to be sexually active  (B) lead to abstaining from having sexual intercourse for at least a year or more  (C) attributes that lead to one having fewer sexual partners. Methodology Determining HIV Predictors  2 techniques used to achieve these  Distinguishing the clusters found by the O- Cluster model  and employing association rule (Apriori)  Applied to 2 datasets Cluster found by O-Cluster model  Dataset O-Cluster model was applied to.  predictors found Determining HIV Predictors On distinguishing clusters found, the attributes HIV test and Know Aids were identified as predictors of condom use and abstinence  While from the associations the attributes HIV test and talk openly have been identified as predictors of condom use.  The predictors Determining HIV Predictors HIV test – if one has had an HIV test  Know Aids – if one knows about AIDS  Talk openly – if one talks openly about HIV AIDS or not  Regarding the evaluation Conclusions  The O-Cluster algorithm produced most effective model:  accuracy 95.5%  When applied to new data  39%  Most effective model by K-Means:  accuracy of 86.9%  When applied to new data  17% Regarding ODM Algorithms Conclusions classification 1. 2. 3. Adaptive Bayes Network Naive Bayes Model Seeker clustering 1. 2. k-Means O-Cluster association rules 1. Apriori (association rules) observations Conclusions Model accuracy somehow indicates performance of model on new data  Therefore it is recommended that one finds the most accurate model for accurate results 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Final Project presentation (20 min)