Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Evaluation of A Commercial Data Mining Suite Oracle Data Mining Presented by Emily Davis Supervisor: John Ebden Oracle Data Mining An Investigation Emily Davis Investigating the data mining tools and software available with Oracle9i. Naive Bayes Use Oracle Data Mining and JDeveloper (Java API) to run algorithms in data mining suite on sample data. Adaptive Bayes An evaluation of results using confusion matrices, lift charts & error rates. A comparison of the effectiveness of different algorithms. Oracle Data Mining, DM4J and JDeveloper Supervisor: John Ebden Contact: [email protected] Visit: http://www.cs.ru.ac.za/research/students/g01D1801/ Model A Model Accept Model Reject Actual Accept 600 25 Actual Reject 75 300 Problem Statement  To determine how Oracle provides data mining functionality  Ease of use  Data preparation  Model building  Model testing  Applying models to new data Problem Statement  To determine whether the algorithms used would find a pattern in a data set  What happened when the models were applied to a new data set  To determine which algorithm built the most effective model and under what circumstances Problem Statement To determine how models are tested and if this indicates how they will perform when applied to new data  To determine how the data affected the model building and how the test data affected the model testing  Methodology  Two Classification algorithms selected:  Naïve Bayes  Adaptive Bayes Network  Both produce predictions which could then be compared Methodology    Data from http://www.ru.ac.za/weather/ Weather data Data recorded includes:          Temperature (degrees F) Humidity (percent) Barometer (inches of mercury) Wind Direction (degrees, 360 = North, 90 = East) Wind Speed (MPH) High Wind Speed (MPH) Solar Radiation (Watts/m^2) Rainfall (inches) Wind Chill (computed from high wind speed and temperature) Data Rainfall reading removed and replaced with a yes or no depending on whether rainfall was recorded  This variable, RAIN, was chosen as the target variable  2 Data sets put into tables in the database   WEATHER_BUILD  WEATHER_APPLY  WEATHER_BUILD  2601 records  Used to create build and test data with Transformation Split wizard  WEATHER_APPLY  290 records  Used to validate models Building and Testing the Models The Priors technique  Training and tuning the models  The models built  Testing Results  Data Preparation Techniques Priors Histogram for:RAIN Bin Count 1400 1200 1000 800 600 400 200 0 yes no Bin Range Priors Histogram for:RAIN Bin Count 1400 1200 1000 800 600 400 200 0 yes no Bin Range Stratified Sampling Priors Histogram for:RAIN Histogram for:RAIN 1400 1200 1200 1000 1000 Bin Count Bin Count 1400 800 600 400 200 800 600 400 0 yes no Bin Range Stratified Sampling 200 0 yes no Bin Range Training and Tuning the Models Predicted No Predicted Yes Actual No 384 34 Actual Yes 141 74 Training and Tuning the Models Viable to introduce a weighting of 3 against false negatives  Makes a false negative prediction 3 times as costly as a false positive  Algorithm attempts to minimise costs  The Models 8 models in total  4 using each algorithm   One using default settings  One using the Priors technique  One using weighting  One using Priors and weighting Testing the Models Tested on test data set created from WEATHER_BUILD data set  Confusion matrices indicating accuracy of models  Testing Results 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% Naïve Bayes Model Settings weighting, priors weighting, no priors no weighting, priors Adaptive Bayes Network no weighting, no priors Accuracy Testing Results Applying the Models to New Data  Models were applied to the new data in WEATHER_APPLY Prediction Probability THE_TIME Prediction Cost of incorrect prediction THE_TIME no 0.9999 1 no 0 1 yes 0.6711 138 yes 0.3288 138 Extracts showing 2 predictions in actual results Attribute Influence on Predictions Adaptive Bayes Network provides rules along with predictions  Rules in if…….then format  Rules showed attributes with most influence were:   Wind Chill  Wind Direction Results of Applying Models to New Data Model Results 80.00% 70.00% Accuracy 60.00% 50.00% Naïve Bayes 40.00% Adaptive Bayes Network 30.00% 20.00% 10.00% 0.00% no no weighting, weighting, weighting, weighting, no priors priors no priors priors Model Settings Comparing Accuracy Model Results 80.00% 90.00% 80.00% 70.00% 60.00% 50.00% 40.00% 30.00% 20.00% 10.00% 0.00% 70.00% Adaptive Bayes Network 60.00% Accuracy Naïve Bayes 50.00% Naïve Bayes 40.00% Adaptive Bayes Network 30.00% 20.00% Model Settings weighting, priors weighting, no priors no weighting, priors 10.00% no weighting, no priors Accuracy Testing Results 0.00% no no weighting, weighting, weighting, weighting, no priors priors no priors priors Model Settings Observations     Algorithms found a pattern in the weather data Most effective model: Adaptive Bayes Network algorithm using weighting Accuracy of Naïve Bayes models improves dramatically if weighting and Priors are used Significant difference between accuracy during testing of models and accuracy when applied to new data Conclusions Oracle Data Mining provides easy to use wizards that support all aspects of the data mining process  Algorithms found a pattern in the weather data   Best case: the Adaptive Bayes Network model predicted 73.1% of RAIN outcomes correctly Conclusions  Adaptive Bayes Network algorithm produced most effective model: accuracy 73.1% when applied to new data  Tuned  using a weighting of 3 against false negatives Most effective model using Naïve Bayes: accuracy of 63.79%  Uses a weighting of 3 against false negatives and uses Priors technique Conclusions Accuracy during testing does not always indicate performance of model on new data  Test accuracy inflated if target attribute distribution in build and test data sets is similar  Shows the need for testing of a model on a variety of data sets  Questions