Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Data assimilation wikipedia , lookup
Forecasting wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Regression analysis wikipedia , lookup
PREDICTION Elsayed Hemayed Data Mining Course Outline 2 Introduction Regression Analysis Linear Regression Multiple Linear Regression Predictor Error Measure Evaluating the Accuracy a Predictor Prediction Introduction 3 “What if we would like to predict a continuous value, rather than a categorical label (like classification)?” Numeric prediction is the task of predicting continuous (or ordered) values for given input. The salary of college graduates with 10 years of work experience, The potential sales of a new product given its price. The most widely used approach for numeric prediction is regression, a statistical methodology Prediction Regression Analysis 4 Regression analysis can be used to model the relationship between one or more independent or predictor variables and a dependent or response variable (which is continuous-valued). The predictor variables are the attributes of interest describing the tuple (the values of the predictor variables are known) The response variable is what we want to predict Given a tuple described by predictor variables, we want to predict the associated value of the response variable Prediction Regression Analysis – cont. 5 We’ll discuss straight-line regression analysis (which involves a single predictor variable) and multiple linear regression analysis (which involves two or more predictor variables) Several software packages exist to solve regression problems. Examples include SAS (www.sas.com), SPSS (www.spss.com), and S-Plus (www.insightful.com). Prediction 6 Linear Regression Straight Line Regression Prediction Straight line regression 7 Straight-line regression analysis involves a response variable, y, and a single predictor variable, x. It is the simplest form of regression, and models y as a linear function of x. That is, y = w0+w1x; where the variance of y is assumed to be constant, w0 and w1 are regression coefficients w0 the Y-intercept w1 the slope of the line. Prediction Straight line regression – cont. 8 These coefficients can be solved for by the method of least squares, which estimates the best-fitting straight line as the one that minimizes the error between the actual data and the estimate of the line. Let D be a training set consisting of values of predictor variable, x, for some population and their associated values for response variable, y. The training set contains |D| data points of the form(x1, y1), (x2, y2), : : : , (x|D|, y|D|) Prediction Straight line regression – cont. 9 where x is the mean value of x1, x2, : : : , x|D|, and y is the mean value of y1, y2, : : : , y|D| Prediction Prediction Example – Salary Data 10 Using Least Square Method y = 23.6+3.5x. Thus the salary of a college graduate with, say, 10 years of experience is $58,600. Prediction Multiple Linear Regression 11 It allows response variable y to be modeled as a linear function of, say, n predictor variables or attributes, A1, A2, : : : , An, describing a tuple, X. (That is, X = (x1, x2, : : : , xn).) An example of a multiple linear regression model based on two predictor attributes or variables, A1 and A2, is y = w0+w1x1+w2x2, where x1 and x2 are the values of attributes A1 and A2, respectively, in X. Prediction Multiple Linear Regression – Least Squares 12 The least squares method can be extended to solve for w0, w1, and w2. The equations, however, become long and are tedious to solve by hand. Multiple regression problems are instead commonly solved with the use of statistical software packages, such as SAS, SPSS, and S-Plus Prediction Predictor Error Measures 13 Let DT be a test set of the form (X1, y1), (X2,y2), : : : , (Xd, yd), where the Xi are the n-dimensional test tuples with associated known values, yi, for a response variable, y, and d is the number of tuples in DT. The mean squared error exaggerates the presence of outliers, while the mean absolute error does not. Prediction Predictor Error Measures – cont. 14 The mean value of the yi’s of the training data, Prediction Evaluating the Accuracy a Predictor – The Holdout method 15 The given data are randomly partitioned into two independent sets, a training set and a test set. Typically, two-thirds of the data are allocated to the training set, and the remaining one-third is allocated to the test set. The training set is used to derive the model, whose accuracy is estimated with the test set. The estimate is pessimistic because only a portion of the initial data is used to derive the model. Prediction Estimating accuracy with the holdout method 16 Prediction Random Subsampling 17 The holdout method is repeated k times. The overall accuracy estimate is taken as the average of the accuracies obtained from each iteration. (For prediction, we can take the average of the predictor error rates.) Prediction Homework due day 3 18 Prepare a database with several thousands of records Define a data mining application to run on your data Download and install free data mining tool Use the tool to mine your data Prepare a demo to present your findings to the class. Prediction Summary 19 Introduction Regression Analysis Linear Regression Multiple Linear Regression Predictor Error Measure Evaluating the Accuracy a Predictor Prediction