Download Prediction - dbmanagement.info

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Data assimilation wikipedia , lookup

Forecasting wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
PREDICTION
Elsayed Hemayed
Data Mining Course
Outline
2






Introduction
Regression Analysis
Linear Regression
Multiple Linear Regression
Predictor Error Measure
Evaluating the Accuracy a Predictor
Prediction
Introduction
3


“What if we would like to predict a continuous value,
rather than a categorical label (like classification)?”
Numeric prediction is the task of predicting
continuous (or ordered) values for given input.
 The
salary of college graduates with 10 years of work
experience,
 The potential sales of a new product given its price.

The most widely used approach for numeric
prediction is regression, a statistical methodology
Prediction
Regression Analysis
4




Regression analysis can be used to model the
relationship between one or more independent or
predictor variables and a dependent or response variable
(which is continuous-valued).
The predictor variables are the attributes of interest
describing the tuple (the values of the predictor
variables are known)
The response variable is what we want to predict
Given a tuple described by predictor variables, we
want to predict the associated value of the response
variable
Prediction
Regression Analysis – cont.
5


We’ll discuss straight-line regression analysis (which
involves a single predictor variable) and multiple
linear regression analysis (which involves two or
more predictor variables)
Several software packages exist to solve regression
problems. Examples include SAS (www.sas.com),
SPSS (www.spss.com), and S-Plus
(www.insightful.com).
Prediction
6
Linear Regression
Straight Line Regression
Prediction
Straight line regression
7





Straight-line regression analysis involves a response
variable, y, and a single predictor variable, x.
It is the simplest form of regression, and models y as
a linear function of x.
That is, y = w0+w1x;
where the variance of y is assumed to be constant,
w0 and w1 are regression coefficients
 w0
the Y-intercept
 w1 the slope of the line.
Prediction
Straight line regression – cont.
8



These coefficients can be solved for by the method
of least squares, which estimates the best-fitting
straight line as the one that minimizes the error
between the actual data and the estimate of the
line.
Let D be a training set consisting of values of
predictor variable, x, for some population and their
associated values for response variable, y.
The training set contains |D| data points of the
form(x1, y1), (x2, y2), : : : , (x|D|, y|D|)
Prediction
Straight line regression – cont.
9

where x is the mean value of x1, x2, : : : , x|D|,
and y is the mean value of y1, y2, : : : , y|D|
Prediction
Prediction
Example – Salary Data
10


Using Least Square Method  y = 23.6+3.5x.
Thus the salary of a college graduate with, say, 10
years of experience is $58,600.
Prediction
Multiple Linear Regression
11



It allows response variable y to be modeled as a
linear function of, say, n predictor variables or
attributes, A1, A2, : : : , An, describing a tuple, X.
(That is, X = (x1, x2, : : : , xn).)
An example of a multiple linear regression model
based on two predictor attributes or variables, A1
and A2, is y = w0+w1x1+w2x2,
where x1 and x2 are the values of attributes A1 and
A2, respectively, in X.
Prediction
Multiple Linear Regression – Least
Squares
12



The least squares method can be extended to solve
for w0, w1, and w2.
The equations, however, become long and are
tedious to solve by hand.
Multiple regression problems are instead commonly
solved with the use of statistical software packages,
such as SAS, SPSS, and S-Plus
Prediction
Predictor Error Measures
13


Let DT be a test set of the form (X1, y1), (X2,y2), : :
: , (Xd, yd), where the Xi are the n-dimensional test
tuples with associated known values, yi, for a
response variable, y, and d is the number of tuples
in DT.
The mean squared error exaggerates the presence
of outliers, while the mean absolute error does not.
Prediction
Predictor Error Measures – cont.
14
The mean value of the yi’s of the training data,
Prediction
Evaluating the Accuracy a Predictor –
The Holdout method
15




The given data are randomly partitioned into two
independent sets, a training set and a test set.
Typically, two-thirds of the data are allocated to the
training set, and the remaining one-third is
allocated to the test set.
The training set is used to derive the model, whose
accuracy is estimated with the test set.
The estimate is pessimistic because only a portion of
the initial data is used to derive the model.
Prediction
Estimating accuracy with the holdout
method
16
Prediction
Random Subsampling
17



The holdout method is repeated k times.
The overall accuracy estimate is taken as the average
of the accuracies obtained from each iteration.
(For prediction, we can take the average of the
predictor error rates.)
Prediction
Homework due day 3
18





Prepare a database with several thousands of
records
Define a data mining application to run on your
data
Download and install free data mining tool
Use the tool to mine your data
Prepare a demo to present your findings to the
class.
Prediction
Summary
19






Introduction
Regression Analysis
Linear Regression
Multiple Linear Regression
Predictor Error Measure
Evaluating the Accuracy a Predictor
Prediction