Download 2nd Presentation

An Investigation of Commercial Data Mining Presented by Emily Davis Supervisor: John Ebden Outline        Data Mining Classification Data Mining Algorithms Choice of Technique Data Mining Process Evaluation of Results Oracle Data Mining Progress Data Mining Classification     Directed data mining builds a model that describes one particular variable in terms of the rest of the data. Includes: Classification, Estimation and Prediction Undirected data mining builds a model to establish the relationships amongst all the variables. Includes: Affinity Groupings or Association Discovery, Clustering and Description or Visualization. Data Mining Algorithms Clustering: Groups instances of data into classes and allows for the discovery of structures in the data.  Neural Networks: Segments the state space of the data with gradients or sloping lines.  Estimation: Determines the value of an unknown output attribute that is numerical.  Prediction: Determines future outcomes of data (similar to estimation).  Classification: Assigns new instances of data to categorical classes.  Association discovery: Discovery of associations between data fields (includes market basket analysis).    Decision Trees: Uses data splitting rules to split data and then apply more data splitting rules to the resulting subsets of data. Association rules: Rule induction to generate patterns relating business goals to other data fields. The patterns are generated as trees with splits on data fields. Choosing a Technique  Supervised Learning:  set of input and output data  clear explanation of results  Association rules:  input  and output data have interesting interactions Decision trees:  known  faster which attributes best define the data  Clustering and neural networks:  all attributes are of equal importance  perform well on noisy data (neural networks)  When increased accuracy is required create multiple models using the same data mining technique until the optimal model is created. Data Mining Process Too much focus on the automatic techniques.  Not enough focus on the exploration and analysis of the problem and the data.  Common to all the presented processes:   Thorough data preparation and exploration  Interpretation and validation of the resulting models Evaluating the Output Evaluation of supervised learning models involves determining the level of predictive accuracy.  Evaluated using test data sets.  Compare error rates of models created from the same training data to determine accuracy.  Model A Model Accept Model Reject Actual Accept 600 25 Actual Reject 75 300    When evaluating numerical output use error rates - the percentage of correct predictions. Mean absolute error = average absolute difference between computed and predicted outcome. Mean squared error rate = average squared difference between computed and desired outcome. Cumulative Gains Chart Evaluating unsupervised learning models using supervised learning       Perform clustering. A cluster is thought of as a class and assigned a name. Random samples are chosen from instances of each class. A supervised model is then built with the class names as output. Random samples are the training set. The remaining instances are used to test the accuracy of the clustering model Measures of interestingness  These include whether the pattern:       is easily understood is valid with a degree of certainty is potentially useful is novel confirms a hypothesis of some kind represents knowledge. Oracle        Adaptive Bayes Network supporting decision trees (classification) Naive Bayes (classification) Model Seeker (classification) k-Means (clustering) O-Cluster (clustering) Predictive variance (attribute importance) Apriori (association rules) ODM public class Sample_NaiveBayesBuild_short extends Object { public static void main ( String[] args ) { System.out.println("Start: " + new java.util.Date()); DataMiningServer dms = null; oracle.dmt.odm.Connection dmsConnection = null; try { // Create an instance of the data mining server and get a connection // The mining server URL, user_name and password need to be specified dms = new DataMiningServer("ora1.ict.ru.ac.za", "system", "emily"); dmsConnection = dms.login(); // Create PhysicalDataSpecification object // First create a LocationAccessData using the table name and schema name LocationAccessData lad = new LocationAccessData("CENSUS_2D_BUILD_UNBINNED", "odm_mtr"); // Create a NonTransactionalDataSpecification object since the dataset is nontransactional PhysicalDataSpecification m_PhysicalDataSpecification = new NonTransactionalDataSpecification(lad); Data Mining for Java(DM4J) Progress Literature Survey  Oracle installed on Ora in COE  Exploring the Oracle Suite including JDeveloper  Member of MetaLink(Oracle’s online support service)  Addressing the Problem: Run the different algorithms available in the data mining suite on sample data using ODM and DM4J.  Document and evaluate results using techniques discussed. 

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 2nd Presentation