Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Choice modelling wikipedia , lookup
Time series wikipedia , lookup
Linear regression wikipedia , lookup
NAME : HESHAM MOUNIR GALAL MOUSTAFA. COURSE : DATA WAREHOUSING AND MINING REGRESSION Under supervision PROF.D.AHAMED SHARAF 1 Regression definition - regression uses existing values to forecast what other values will be. - A data mining function for predicting continuous target values for new records using a model built from records with known target values. - Regression creates predictive. - Regression is a predictive modeling technique where the target variable to be estimated is continuous. - A predictive model when the value of dependent variable is not known its value is predicted by the point on the line that corresponds to the values of the independent variables for that record. Examples Let D denote a data set that contains N observation. D={mₓ , yₓ}| x=1,2, ……,N}. Each mₓ corresponds to the set of attributes of the ₓth observation (also known as explanatory variables . yₓ corresponds to the target variable Regression is the task of learning a target function F that maps each attribute set m into continuous Valued output y. The goal of regression is to find target function that can fit the input data with minimum error. the error function for a regression task can be expressed in terms of absolute or squared error Absolute error= Ʃₓ |yₓ - f (mₓ) \ Squared error= Ʃₓ (yₓ - f (mₓ) )² 2 Linear regression In linear regression, the model specification is that the dependent variable, yi is a linear combination of the parameters (but need not be linear in the independent variables). For example, in simple linear regression for modeling N data points there is one independent variable: xi, and two parameters, β0 and β1: straight line: Y¡= β0+ β1X¡ + ∈ ; ¡= 1,……….., N. In the case of simple regression, the formulas for the least squares estimates are: program if we have regression equation Weight = β0+ β1 height + ∈ Where:Weight is the response variable. β0,β1 are the unknown parameters. Height is the regresses variable. 3 ∈ is the unknown error. The program Data class; Input name $ height weight age; Datalines; Alfred 69.0 112.5 14 Alice 56.5 84.0 13 barbara 65.3 98.0 13 carol 62.8 102.5 14 Henry 63.5 102.5 14 jams 57.3 83.0 12 ; symbol v=dot c=cube height=3.5pct; proc reg; model weight=height; plot weight * height / cframe=ligr; run; 4 Nonlinear regression nonlinear regression is a form of regression analysis in which observational data are modeled by a function which is a nonlinear combination of the model parameters and depends on one or more independent variables. The data are fitted by a method of successive approximations. A nonlinear model is one in which the calculated value . IT is a nonlinear function of the parameters and can be written as £(X, β ) = β1X / β2 + X. MULTIPLE REGRESSION it has more than one independent (x) variable like linear and nonlinear the dependent (y) variable is measurement. Is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variables General model : Y = a + b1*X1 + b2*X2 + ... + bp*Xp a real estate agent might record for each listing the size of the house (in square feet), the number of bedrooms, the average income in the respective neighborhood according to census data, and a subjective rating of appeal of the house. Once this information has been compiled for various houses it would be interesting to see whether and how these measures relate to the price for which a house is sold. 5 Logistic regression logistic regression is a model used for prediction of the probability of occurrence of an event by fitting data to a logistic curve. It makes use of several predictor variables that may be either numerical or categorical. For example, the probability that a person has a heart attack within a specified time period might be predicted from knowledge of the person's age, sex. Logistic regression is used extensively in the medical and social sciences as well as marketing applications such as prediction of a customer's propensity to purchase a product or cease a subscription . logistic regression begins with an explanation of the logistic function: The variable z is usually defined as : Linear regression function The regression support the fitting of an ordinary least squares regressions line to a set of numbers pairs. You can use them as both aggregate functions or windows or reporting functions. The function are as follows: o REGR_COUNT FUNCTION. o REGR_AVEGYAND REG_AVGX FUNCTIONS. 6 o REGR_SLOPANDREGR_INTERCEPT FUNCTION. o REGR_R2 FUNCTION. o REGR_SXX,REGR_SYY,REGR_SXY. REGER_COUNT FUNCTION return the number of non_null numbers pairs used to fit the regression line. REGR_AVEGY AND REG_AVGX FUNCTIONS compute the averages of the dependent variables and independent variables of the regression line respectively. REGR_AVGY computes average of its first argument (e1) after null elimination. REGR_AVGX computes the average of its second argument(e2) after null elimination. REGR_SLOP AND REGR_INTERCEPT FUNCTIONS The REGR_SLOP function computes the slop of the regression line fitted to non-null (e1,e2). The REGR_INTERCEPT computes the y intercept of the regression line. REGR_R2 FUNCTION REGR_R2 FUNCTION THE coefficient of determination (r_squared). It returns values between o,1. computes the usually called Sample linear regression calculation Select s.channel_id,regr_slop(s.quantity_sold, 7 p.prod_list_price) slop , regr_intercept (s.quantity_sold,p.prod_list_price) incept, regr_r2 (s.quantity_sold , p.prod_list_price) rsor , regr_count(s.quantity_sold,p.prod_list_price) count,regr_avgx(s.quantity_sold,p.prod_list_ price)avglistp,regr_avgy (s.quantity_sold,p.prod _list _price ) avgqsold FROM sales s , products p wheres.prod_id=p.prod_idandp.prod _category =‘electronics’ and s.time_id =to_date(’10-oct2000’)Group by s.channel_id; Regression .VS. classification Regression deals with numerical continuous target attributes whereas classification deals with discrete categorical target attributes. If the target attribute contains continuous (floating , points values a regression technique is required. If the target attribute contains categorical (string or discrete integer values a classification technique is called for. Classification Using Regression Division: Use regression function to divide area into regions. Prediction: Use regression function to predict a class membership function. Input includes desired class. 8 Height Example Data Name Kristina Jim Maggie Martha Stephanie Bob Kathy Dave Worth Steven Debbie Todd Kim Amy Wynette Gender F M F F F M F M M M F M F F F Height 1.6m 2m 1.9m 1.88m 1.7m 1.85m 1.6m 1.7m 2.2m 2.1m 1.8m 1.95m 1.9m 1.8m 1.75m Division 9 Output1 Short Tall Medium Medium Short Medium Short Short Tall Tall Medium Medium Medium Medium Medium Output2 Medium Medium Tall Tall Medium Medium Medium Medium Tall Tall Medium Medium Tall Medium Medium Prediction classification Regression trees CART Regression trees are used to predict the continuous value of the target variable using the values of the predicator variables. classification trees the predicator variables are used to classify object into categories e.g to predict the categorical value of the target variable. Regression tree Regression tree is built through a process known as binary recursive partitioning. This is an iterative process of splitting the data into partitioning. And then splitting it up further on each of the branches. If the target variable is continuous, then a regression tree is generated. 10 When using a regression tree to predict the value of the target variable, the mean value of the target variable of the rows falling in a terminal (leaf) node of the tree is the estimated value. In this example, the target variable is “Median value”. From the tree we see that if the value of the predictor variable “Num. rooms” is greater than 6.941 the estimated (average) value of the target variable is 37.238, whereas if the number of rooms is less than or equal to 6.941 the average value of the target variable is 19.934. Support vector machine algorithm SVM Is a state of art classification and regression algorithm. SVM is an algorithm with strong regulation properties. 11 A regression SVM model tries to find a continuous function such that maximum number of data points SVM performs well with real world application such as classifying text and recognition hand written characters. In the parlance of SVM literature, a predictor variable is called an attribute, and a transformed attribute that is used to define the hyperplane is called a feature. The task of choosing the most suitable representation is known as feature selection. A set of features that describes one case (i.e., a row of predictor values) is called a vector. So the goal of SVM modeling is to find the optimal hyperplane that separates clusters of vector in such a way that cases with one category of the target variable are on one side of the plane and cases with the other category are on the other size of the plane. The vectors near the hyperplane are the support vectors. The figure below presents an overview of the SVM. 12 References Fundamentals of database systems_ELMASRI & NAVATHE. Support Vector Machines www.dtreg.com Oracle data base guide PAUL LANE. Introduction to regression procedure SAS INSTITUTE USA. www.en.wikipedia.org www.support-vector.net www.statsoft.com 13