Download 0 - Stevens Institute of Technology

Knowledge Discovery in Databases MIS 637 Professor Mahmoud Daneshmand Fall 2012 Final Project: Red Wine Recipe Data Mining By Jorge Madrazo Profound Questions • What basic properties are the formula for a good wine? – Wine making is believed to be an art. But is there a formula for a quality wine? – There was a paper on “Modeling wine preferences by Data Mining” submitted by the provider of the data set. How do my results compare with the paper’s? Procedure • Follow a data mining process • Use SAS and SAS Enterprise Miner to execute the process • SAS Enterprise Miner tool is modeled on the SAS Institute defined data mining process of SEMMA – Sample, Explore, Modify, Model, Assess • SEMMA is similar to the CRISP DM process Sample • 1,599 records • Set up a data partition – Training 40% – Validation 30% – Test 30% Explore: Data Background • Data source – UCI Machine Learning Repository. • Wine Quality Data Set. – There are a red and white wine data set. I focused on the red wine set only. – There are 11 input variables and one target variable. » fixed acidity » volatile acidity » citric acid » residual sugar » chlorides » free sulfur dioxide » total sulfur dioxide » density » pH » sulphates » alcohol » Output variable (based on sensory data): quality (score between 0 and 10) Explore: Target=Quality • Quality – People gave a quality assessment of different wines on a scale of 0-10. Actual range 3-8. – An ordinal target Explore: Inputs • Correlation Analysis – Some correlation, but not enough to discard inputs • ods graphics on; • ods select MatrixPlot; • proc corr data=wino.red PLOTS(MAXPOINTS=100000 ) • plots=matrix(histogram nvar=all); • var quality alcohol ph fixed_acidity density volatile_acidity sulphates citric_acid; • run; Explore: Correlation Graphs Explore: Chi2 Statistics of Inputs Explore: Worth of Inputs Explore: Worth Graph • The Worth Tracks closely with the Chi Statistic Modify • At this stage, no modifications are done Model: Selection • Because I want to list the important elements in what is considered a quality wine, I choose a Decision Tree • Configuration – The Splitting Rule is Entropy – Maximum Branch is set to 5 • Therefore a C4.5 type of algorithm is being implemented Assess: Initial Results • A Bushy Tree using. The Resulting tree is too intricate for simple recommendation. – Over 20 Leaf nodes. Modify: Target • Change the target so that it becomes a binary. • New variable in the model called isGood. Any rating over 6 is categorized as isGood. – SAS Code: data wino.xx; set wino.red; if (quality>6) then isgood=1; else isgood = 0; run; proc print data = wino.xx; title 'xx'; run; Explore: Target = isGood Model Strategy for isGood • Model with Decision Tree to hope for more descriptive results. • Also model with Neural Network to aid in assessment and do comparison Model: Decision Tree • ProbF splitting criteria at Significance Level .2 • Maximum Branch size = 5 Assess: Decision Tree Results • Much simpler Tree Assess: Decision Tree Results 2 • Leaf Statistics Assess: Variable Importance Number of Number of Variable Splitting Surrogate Name Label Rules Rules alcohol 1 0 density 0 1 volatile_acidity 0 1 sulphates 1 0 fixed_acidity 0 1 citric_acid 0 1 free_sulfur_dioxide 0 0 pH 0 0 chlorides 0 0 total_sulfur_dioxide 0 0 residual_sugar 0 0 Importance Validation Importance 1 0.77055175 0.728868987 0.671675628 0.553719729 0.549750361 0 0 0 0 0 1 1 0.77055175 1 0.728868987 1 0.477710505 0.711222032 0.393817671 0.711222032 0.390994569 0.711222032 0 NaN 0 NaN 0 NaN 0 NaN 0 NaN Event Classification Table Data Role=TRAIN Target=isgood False Negative False True Negative Positive 53 539 True Positive 14 34 Data Role=VALIDATE Target=isgood False Negative False True Negative Positive 43 403 True Positive 12 Ratio of Validation to Training Importance 21 Model: Neural Network • Positive – better at predicting • Negative – hard to interpret the model • Configured with 3 Hidden Nodes Modify: Input Variables to NN • Because of the complexity of the NN, it is recommended to prune variables prior to running the network. Modify: R2 Filter Variable Name alcohol chlorides citric_acid density fixed_acidity free_sulfur_dioxide pH residual_sugar sulphates total_sulfur_dioxide volatile_acidity Role INPUT INPUT REJECTED INPUT INPUT INPUT REJECTED REJECTED INPUT REJECTED INPUT Measurement Level INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL INTERVAL Reasons for Rejection Varsel:Small R-square value Varsel:Small R-square value Varsel:Small R-square value Varsel:Small R-square value Model: NN • Specify 3 Hidden Units in the Hidden Layer Assess: NN Results • Hard to interpret results to formulate a recipe The NEURAL Procedure Optimization Results Parameter Estimates Gradient Objective N Parameter Estimate Function 1 alcohol_H11 3.679818 -0.001411 2 chlorides_H11 0.520190 -0.000479 3 density_H11 -2.171623 0.000883 4 fixed_acidity_H11 -0.055929 0.000179 5 free_sulfur_dioxide_H11 0.403412 0.000139 6 sulphates_H11 -4.954290 -0.000224 7 volatile_acidity_H11 2.686209 0.000205 8 alcohol_H12 -0.313005 0.001209 9 chlorides_H12 0.200973 0.000759 Assess: Comparative Results • Receiver Operating Characteristics (ROC) Chart for NN vs Decision Tree Assess: Comparative Results • Cumulative Lift for NN vs Decision Tree Assess: Comparison with Reference Paper • Used R-Miner • Support Vector Machine (SVM) and Neural Network used • He applied techniques to extract relative importance of variables • He attempted to predict every quality level • He noted the importance of alcohol and sulphates. “An increase in sulphates might be related to the fermenting nutrition, which is very important to improve the wine aroma.” Assess: Paper Variable Importance Overall Project in SAS EM References • UCI Machine Learning Repository http://archive.ics.uci.edu/ml/datasets/Wine • P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547553, 2009. • Modeling wine preferences by data mining from physicochemical properties, Paulo Cortez et. al http://www3.dsi.uminho.pt/pcortez/wine5.pdf

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 0 - Stevens Institute of Technology