Download regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Time series wikipedia , lookup

Linear regression wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
NAME : HESHAM MOUNIR GALAL MOUSTAFA.
COURSE : DATA WAREHOUSING AND MINING
REGRESSION
Under supervision
PROF.D.AHAMED SHARAF
1
Regression definition
- regression uses existing values to forecast what
other values will be.
- A data mining function for predicting continuous
target values for new records using a model built
from records with known target values.
- Regression creates predictive.
- Regression is a predictive modeling technique
where the target variable to be estimated is
continuous.
- A predictive model when the value of dependent
variable is not known its value is predicted by the
point on the line that corresponds to the values
of the independent variables for that record.
Examples
 Let D denote a data set that contains N
observation.
 D={mₓ , yₓ}| x=1,2, ……,N}.
 Each mₓ corresponds to the set of attributes of
the ₓth observation (also known as explanatory
variables .
 yₓ corresponds to the target variable
 Regression is the task of learning a target
function F that maps each attribute set m into
continuous
 Valued output y.
 The goal of regression is to find target function
that can fit the input data with minimum error.
 the error function for a regression task can be
expressed in terms of absolute or squared error
 Absolute error= Ʃₓ |yₓ - f (mₓ) \
 Squared error= Ʃₓ (yₓ - f (mₓ) )²
2
Linear regression
 In linear regression, the model specification is
that the dependent variable, yi is a linear
combination of the parameters (but need not be
linear in the independent variables). For example,
in simple linear regression for modeling N data
points there is one independent variable: xi, and
two parameters, β0 and β1:
 straight line: Y¡= β0+ β1X¡ + ∈ ; ¡= 1,……….., N.
 In the case of simple regression, the formulas for
the least squares estimates are:
program
if we have regression equation
Weight = β0+ β1 height + ∈
Where:Weight is the response variable.
β0,β1 are the unknown parameters.
Height is the regresses variable.
3
∈ is the unknown error.
The program
Data class;
Input name $ height weight age;
Datalines;
Alfred 69.0 112.5 14
Alice
56.5 84.0
13
barbara 65.3 98.0
13
carol
62.8 102.5
14
Henry
63.5 102.5
14
jams
57.3 83.0
12
;
symbol v=dot c=cube height=3.5pct;
proc reg;
model weight=height;
plot weight * height / cframe=ligr;
run;
4
Nonlinear regression
 nonlinear regression is a form of regression
analysis in which observational data are modeled
by a function which is a nonlinear combination of
the model parameters and depends on one or
more independent variables. The data are fitted
by a method of successive approximations.
 A nonlinear model is one in which the calculated
value .
 IT is a nonlinear function of the parameters and
can be written as £(X, β ) = β1X / β2 + X.
MULTIPLE REGRESSION
 it has more than one independent (x) variable like
linear and nonlinear the dependent (y) variable is
measurement.
 Is to learn more about the relationship between
several independent or predictor variables and a
dependent or criterion variables
 General model :
Y = a + b1*X1 + b2*X2 + ... + bp*Xp
 a real estate agent might record for each listing
the size of the house (in square feet), the number
of bedrooms, the average income in the
respective neighborhood according to census
data, and a subjective rating of appeal of the
house. Once this information has been compiled
for various houses it would be interesting to see
whether and how these measures relate to the
price for which a house is sold.
5
Logistic regression
 logistic regression is a model used for prediction
of the probability of occurrence of an event by
fitting data to a logistic curve. It makes use of
several predictor variables that may be either
numerical or categorical. For example, the
probability that a person has a heart attack within
a specified time period might be predicted from
knowledge of the person's age, sex.
 Logistic regression is used extensively in the
medical and social sciences as well as marketing
applications such as prediction of a customer's
propensity to purchase a product or cease a
subscription .
 logistic regression begins with an explanation of
the logistic function:
 The variable z is usually defined as :
Linear regression function
 The regression support the fitting of an ordinary
least squares regressions line to a set of
numbers pairs.
 You can use them as both aggregate functions or
windows or reporting functions.
 The function are as follows:
o REGR_COUNT FUNCTION.
o REGR_AVEGYAND REG_AVGX FUNCTIONS.
6
o REGR_SLOPANDREGR_INTERCEPT
FUNCTION.
o REGR_R2 FUNCTION.
o REGR_SXX,REGR_SYY,REGR_SXY.
REGER_COUNT FUNCTION
 return the number of non_null numbers pairs used
to fit the regression line.
REGR_AVEGY AND REG_AVGX FUNCTIONS
 compute the averages of the dependent
variables and independent variables of the
regression line respectively.
 REGR_AVGY computes average of its first
argument (e1) after null elimination.
 REGR_AVGX computes the average of its
second argument(e2) after null elimination.
REGR_SLOP AND REGR_INTERCEPT FUNCTIONS
 The REGR_SLOP function computes the slop of
the regression line fitted to non-null (e1,e2).
 The REGR_INTERCEPT computes the y intercept
of the regression line.
REGR_R2 FUNCTION
REGR_R2
FUNCTION
 THE
coefficient of determination
(r_squared).
 It returns values between o,1.
computes
the
usually called
Sample linear regression calculation
 Select s.channel_id,regr_slop(s.quantity_sold,
7
p.prod_list_price) slop , regr_intercept
(s.quantity_sold,p.prod_list_price) incept, regr_r2
(s.quantity_sold , p.prod_list_price) rsor ,
regr_count(s.quantity_sold,p.prod_list_price)
count,regr_avgx(s.quantity_sold,p.prod_list_
price)avglistp,regr_avgy (s.quantity_sold,p.prod
_list _price ) avgqsold FROM sales s , products p
wheres.prod_id=p.prod_idandp.prod _category
=‘electronics’ and s.time_id =to_date(’10-oct2000’)Group by s.channel_id;
Regression .VS. classification
 Regression deals with numerical continuous
target attributes whereas classification deals with
discrete categorical target attributes.
 If the target attribute contains continuous
(floating , points values a regression technique is
required.
 If the target attribute contains categorical (string
or discrete integer values a classification
technique is called for.
Classification Using Regression
 Division: Use regression function to divide area
into regions.
 Prediction: Use regression function to predict a
class membership function.
Input includes
desired class.
8
Height Example Data
Name
Kristina
Jim
Maggie
Martha
Stephanie
Bob
Kathy
Dave
Worth
Steven
Debbie
Todd
Kim
Amy
Wynette
Gender
F
M
F
F
F
M
F
M
M
M
F
M
F
F
F
Height
1.6m
2m
1.9m
1.88m
1.7m
1.85m
1.6m
1.7m
2.2m
2.1m
1.8m
1.95m
1.9m
1.8m
1.75m
Division
9
Output1
Short
Tall
Medium
Medium
Short
Medium
Short
Short
Tall
Tall
Medium
Medium
Medium
Medium
Medium
Output2
Medium
Medium
Tall
Tall
Medium
Medium
Medium
Medium
Tall
Tall
Medium
Medium
Tall
Medium
Medium
Prediction
classification Regression trees CART
 Regression trees are used to predict the
continuous value of the target variable using the
values of the predicator variables.
 classification trees the predicator variables are
used to classify object into categories e.g to
predict the categorical value of the target
variable.
Regression tree
 Regression tree is built through a process known
as binary recursive partitioning.
 This is an iterative process of splitting the data
into partitioning. And then splitting it up further
on each of the branches.
 If the target variable is continuous, then a
regression tree is generated.
10
 When using a regression tree to predict the value
of the target variable, the mean value of the target
variable of the rows falling in a terminal (leaf)
node of the tree is the estimated value.
 In this example, the target variable is “Median
value”. From the tree we see that if the value of
the predictor variable “Num. rooms” is greater
than 6.941 the estimated (average) value of the
target variable is 37.238, whereas if the number of
rooms is less than or equal to 6.941 the average
value of the target variable is 19.934.
Support vector machine algorithm SVM
 Is a state of art classification and regression
algorithm.
 SVM is an algorithm with strong regulation
properties.
11
 A regression SVM model tries to find a
continuous function such that maximum number
of data points
 SVM performs well with real world application
such as classifying text and recognition hand
written characters.
 In the parlance of SVM literature, a predictor
variable is called an attribute, and a transformed
attribute that is used to define the hyperplane is
called a feature. The task of choosing the most
suitable representation is known as feature
selection. A set of features that describes one
case (i.e., a row of predictor values) is called a
vector. So the goal of SVM modeling is to find the
optimal hyperplane that separates clusters of
vector in such a way that cases with one
category of the target variable are on one side of
the plane and cases with the other category are
on the other size of the plane. The vectors near
the hyperplane are the support vectors. The
figure below presents an overview of the SVM.
12
References
 Fundamentals of database systems_ELMASRI
& NAVATHE.
 Support Vector Machines www.dtreg.com
 Oracle data base guide PAUL LANE.
 Introduction to regression procedure SAS
INSTITUTE USA.
 www.en.wikipedia.org
 www.support-vector.net
 www.statsoft.com
13