Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSCI 200 DATA MINING Introduction to Linear Regression – Predicting Quality of Wine Predicting Quality of Wine • Linear Regression is simple and powerful method to analyze data and make predictions • Bordeaux is a region in France popular for producing wine • There are differences in price and quality from year to year that are sometimes very significant • Bordeaux wines are widely believed to taste better when they are older. • There is an incentive to store young wines until they are mature Predicting Quality of Wine • The main issue: it is hard to determine the quality of the wine when it is so young just by tasting it, since the taste will change significantly by the time it will be consumed • Wine testers and experts taste the wine and then predict which ones will be the best one latest • Question: can we model this process and make stronger predictions • Predicting Quality of Wine • On March 4, 1990, the New York Times announced that Princeton Professor of Economics Orley Ashenfelter can predict the quality of Bordeaux wine without tasting a single drop. • Ashenfelter's predictions have nothing to do with assessing the aroma of the wine. • They are the results of a mathematical model. • Ashenfelter used a method called linear regression. Linear Regression • The methods predicts an outcome variable or dependent variable. • It uses a set independent variables. • Dependent variable: a typical price in 1990-1991 for Bordeaux wine in an auction. • This approximates quality. • independent variables: age of the wine-- so the older wines are more expensive--and weatherrelated information Linear Regression • Four independent variables: • The age of the wine • The average growing season temperature • The harvest rain • The winter rain Quality of Wine – Linear Regression • Professor Ashenfelter believed that his predictions are more accurate than those of the world's most influential wine critic, Robert Parker. • Robert M. Parker Jr., generally regarded as the most influential wine critic in America, calls Professor Ashenfelter's research ''ludicrous and absurd.'' Predicting Quality of Wine - Links • http://www.wine- economics.org/workingpapers/AAWE_WP0 4.pdf • http://www.wine-economics.org/ • http://www.nytimes.com/1990/03/04/us/win e-equation-puts-some-noses-out-ofjoint.html One-Variable Linear Regression • This method uses one independent variable to predict the • • • • dependent variable Independent variable: average growing season temperature (AGST) The dependent variable, wine price. The goal of linear regression is to create a predictive line through the data. There are many different lines that could be drawn to predict wine price using average growing season temperature Simple Prediction - Average • The equation for this line: • y = 7.07 • This linear regression model would predict 7.07 regardless of the temperature. Better Prediction 0.5*Only(AGST)-1.25 • This linear regression model would predict a higher price when the temperature is higher. General Equation • Y = A*X + B – the model • X – independent variable (in our case AGST) • Y- dependent variable (in our case Price) • Using this equation we will calculate PREDICTION values • Model makes Errors • Y=A*X+B+E • Error term, E, is also often called a residual. Y[i]=A*X[i]+B + E[i] • For each observation, i, we have data for the dependent variable Yi and data for the independent variable, Xi. • Using this equation we make a prediction. • This prediction is hopefully close to the true outcome, Yi. • Since the coefficients have to be the same for all data points, i, we often make a small error, E[i] • The best model (choice of A and B) has the smallest error SSE – Sum of Squared Errors •SSE for Average Line 10.15064 •SSE for 0.5*AGST-1.25 6.03251 Better Measures for Regression Quality • Root Means Squared Error (RMSE): RMSE = SQRT(SSE/N) (N – is the total number of data points) • R squared – R2 • R2 compares the best model to a baseline model • Baseline model – is the model that does not use any variables - AVERAGE • The baseline model predicts the average value of the dependent variable regardless of the value of the independent variable. R2 • The sum of squared errors for the baseline model is also • • • • • • • known as the total sum of squares, commonly referred to as SST. In our Example: SST= 10.15 R2 = 1 – SSE/ SST SSE>=0, SST>=0 SSE<=SST (Y = A*X + B, if A = 0 we get Baseline Model) Linear regression model will never be worse than the baseline model. R2 = 1 – Perfect Predictive Mode R2 = 0 – No Improvement over the baseline R2 • R2 is unitless and universally interpretable between problems. • However, it can still be hard to compare between problems. • Good models for easy problems will have an R2 close to 1. • But good models for hard problems can still have an R2 close to zero. Regression Model Result • The line that gives the minimum sum of squared errors is the line that regression model will find. • Formula for the Linear Regression Model: • Y = 0.63509*AGST-3.4178 • R2 = 0.43502 • SSE = 5.73488