Download Mining Frequent Patterns Without Candidate Generation

Regression Analysis   The contents in this chapter are from Chapters 20-23 of the textbook. The cntry15.sav data will be used. The data collected 15 countries’ information    lifeexpf: female life expectancy Birthrat: births per 1000 population Both are scale variables. 1 Linear regression model 2 Linear regression model  It is obviously, the points are not randomly scattered over the grid. Instead, there appears to be a pattern.  As birthrate increases, life expectancy decreases.  How to choose the “best” line?  The least squares principle is recommended. 3 Least squares principle 4 Least squares principle    Dependent variable: the variable you wish to predict Independent variable: variables used to make the prediction Simple linear regression: in which a single numerical independent variable X is used to predict the numerical dependent variable Y. Y   0  1 X   where  0 and 1 are regression coefficien ts,  is random error with E( )  0 and Var( )   2 .  0 , 1 and  are unknown. 2 5 Least squares principle To fit a data set {xi , yi , i  1,2, ,n} to the above model we have Yi   0  1 X i   i where Yi  dependent variable (sometimes referred to as the response variable) X i  independen t variable (sometimes referred to as the explanator y variable )  0  intercept for the population 1  slope for the population  i  random error in Yi for observatio n i ,  i' s are iid with mean 0 and variance  2 . 6 Least squares principle The least squares method can help us to estimate the regression coefficien ts  0 and 1 and variance of the random error. The idea of the least squares estimation for  0 and 1 is to minimize n n 2 ˆ (Y  Y )  (Y  (b  b X ) )  i i  i 0 1 i 2 i 1 i 1 Yˆi  b0  b1 X i where Yˆi  predicted value of Y for observatio n i X i  value of X for observatio n i b0  sample Y intercept b1  sample slope 7 Least squares principle 1 n 1 n x   xi , y   yi n i 1 n i 1 1  n  n  SS xy   ( xi  x )( yi  y )   xi yi    xi   yi  n  i 1  i 1  i 1 i 1 n n 1  SS x   ( xi  x )2   xi2    xi  n  i 1  i 1 i 1 SS xy b1  ,b0  y  b1 x SS x n n n 2 1 n 2 ˆ s  ˆ  ( y  y )  i i n  2 i 1 where ˆy i  b 0  b1x i , i  1,  , n. 2 2 8 Linear regression model 9 Linear regression model   The regression model becomes life expectancy=90-(0.70 x birthrate) That tells us that for an increase of 1 in birthrate, there is a decrease in life expectancy of 0.70 years. C oe ff i ci en tsa Model 1 (Constant) Births per 1000 population, Unstandardized Coefficients B Std. Error 89.985 1.765 -.697 .050 Standardized Coefficients Beta -.968 t 50.995 Sig. .000 -13.988 .000 a. Dependent Variable: Female life expectancy 10 Prediction and residuals C as e S um ma r ie s country Algeria Burkina Faso Cuba Equador France Mongolia Namibia Netherlands North Korea Somalia Tanzania Thailand Turkey Zaire Zambia 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Births per 1000 population, 31 50 18 28 13 34 45 13 24 46 50 20 28 45 48 Female life expectancy 68 53 79 72 82 68 63 81 72 55 55 71 72 56 59 Unstandardize d Predicted Value 68.36833 55.11929 77.43346 70.46028 80.92004 66.27637 58.60588 80.92004 73.24955 57.90856 55.11929 76.03882 70.46028 58.60588 56.51393 Unstandardize d Residual -.36833 -2.11929 1.56654 1.53972 1.07996 1.72363 4.39412 .07996 -1.24955 -2.90856 -.11929 -5.03882 1.53972 -2.60588 2.48607 11 Coefficient of Correlation  It measures the strength of the linear relationship between two numerical variables. cov( X , Y ) r S X SY n where cov( X ,Y )  n SX  ( X i 1 i ( X i 1 X ) n 1 i  X )( Yi  Y ) n 1 n 2 SY  ( Y  Y ) i 1 2 i n 1 12 Coefficient of Correlation  Coefficient of correlation  -1=< r >= 1 13 Prediction and residuals  The coefficient determination regression sum of squares SSR R   total sum of squares SST 2 14 ANOVA M od el Su mm a ryb Adjusted R Square .933 Std. Error of the Estimate 2.537 Model R R Square a 1 .968 .938 a. Predictors: (Constant), Births per 1000 population, b. Dependent Variable: Female life expectancy A NO VA b Model 1 Regression Residual Total Sum of Squares 1259.263 83.671 1342.933 df 1 13 14 Mean Square 1259.263 6.436 F 195.653 Sig. .000a a. Predictors: (Constant), Births per 1000 population, b. Dependent Variable: Female life expectancy 15 Testing hypotheses about the assumptions    Independence: all of the observations are independent The variance homogeneity: the variance of the distribution of the dependent variable must be the same for all values of the independent variable. Normality: for each value of the independent variable, the distribution of the related dependent variable follows a normal distribution. 16 Testing hypotheses C oe ff i ci en t sa Model 1 (Constant) Births per 1000 population, Unstandardized Coefficients B Std. Error 89.985 1.765 -.697 .050 Standardized Coefficients Beta -.968 t 50.995 Sig. .000 -13.988 .000 95% Confidence Interval for B Lower Bound Upper Bound 86.173 93.797 -.805 -.590 a. Dependent Variable: Female life expectancy M od el Su mm a ryb Adjusted R Square .933 Std. Error of the Estimate 2.537 Model R R Square 1 .968a .938 a. Predictors: (Constant), Births per 1000 population, b. Dependent Variable: Female life expectancy 17 Testing hypotheses  Testing that the slope is zero   In this example, the sample slope is about -0.70 and its standard error is 0.05, so the value for the t statistics is -0.70/0.05=-14, related p-value is less that 0.0005. We should reject the hypothesis. There appears to be a linear relationship between 1992 female life expectancy and birthrate. The 95% confidence interval for the population slope is (-0.805, -0.590). 18 Prediction    The regression equation obtained can be used for predict the life expectancy based on birthrates. For a country with a birthrate of 30 per 1000 population Predicted life expectancy =89.99-0.697 x 30=69.08 years 19 Predicting means and individual observations    The plot on the next page gives the standard error of the predicted mean life expectancy for different values of birthrate. The vertical line at 32.9 is the average birthrate for all cases. The farther birthrates are from the sample mean, the larger the standard error of the predicted means. 20 Plot of standard error of predicted mean 21 The 95% fitting confidence region 22 Statistical diagnostics  Is the model correct?  Are there any outliers?  Is the variance constant?  Is the error normally distributed? 23 Statistical diagnostics    Residuals can provide many useful information for the above four issues in statistical diagnostics. You can’t judge the related size of a residual by looking at its value alone as it depends on the unit of the dependent variable and are not convenient to use. Standardized residuals: divide the residual by the estimated standard deviation of the residuals. 24 Statistical diagnostics  If the distribution of residuals is approximately normal, about 95% of the standardized residuals should be between -2 and 2; 99% should be between -2.58 and 2.58. It is easy to see whether there are some outliers. 25 Statistical diagnostics     When you compute a standardized residuals, all of the observed residuals are divided by the same number. The variability of the dependent variable is not constant for all points, but depends on the value of the independent variable. The studentized residual takes into account the differences in variability from point to point. We calculate it by dividing the residual by an estimate of the standard deviation of the residual at that point. 26 Statistical diagnostics   A residual divided by an estimate of the standard deviation of the residual at that point is called its studentized residual. The studentized residuals make it easier to see violations of the regression assumptions. 27 Statistical diagnostics C as e S um ma ri e s country Algeria Burkina Faso Cuba Equador France Mongolia Namibia Netherlands North Korea Somalia Tanzania Thailand Turkey Zaire Zambia 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Births per 1000 population, 31 50 18 28 13 34 45 13 24 46 50 20 28 45 48 Female life expectancy 68 53 79 72 82 68 63 81 72 55 55 71 72 56 59 Unstandardize d Predicted Value 68.36833 55.11929 77.43346 70.46028 80.92004 66.27637 58.60588 80.92004 73.24955 57.90856 55.11929 76.03882 70.46028 58.60588 56.51393 Unstandardize d Residual -.36833 -2.11929 1.56654 1.53972 1.07996 1.72363 4.39412 .07996 -1.24955 -2.90856 -.11929 -5.03882 1.53972 -2.60588 2.48607 Standardized Residual -.14518 -.83536 .61749 .60691 .42569 .67940 1.73204 .03152 -.49254 -1.14647 -.04702 -1.98616 .60691 -1.02716 .97994 Studentized Residual -.15039 -.92252 .67055 .63132 .48171 .70344 1.85005 .03566 -.51832 -1.23146 -.05193 -2.13011 .63132 -1.09715 1.06610 28 Standardized Residuals Standardized Residual Stem-and-Leaf Plot Frequency 3.00 4.00 7.00 1.00 Stem & Leaf -1 . -0 . 0. 1. 019 0148 0466669 7 Stem width: 1.00000 Each leaf: 1 case(s) 29 Checking for normality 30 Checking for normality    If the data are a sample from a normal distribution, you expect the points to fall more or less on a straight line. You can see the two largest residuals in absolute value (Thailand and Namibia) are stragglers from the line. Next page is a detrended normality plot. If the data are from a normal, the points in the detrended normal plot should fall randomly in a band abound 0. 31 Checking for normality 32 Testing for normality  Many statistical tests for normality have been proposed, one of them is the Kolmogorov-Smirnov test. Tests of Normality Kolmogorov-Smirnova Shapiro-Wilk Statistic df Sig. Statistic df Standardized Residual .137 15 .200* .971 15 *. This is a lower bound of the true significance. Sig. .866 a. Lilliefors Significance Correction 33 Checking for constant variance    Residual plot: plot of studentized residuals against the estimated values. From the residual plot you can see whether there are some pattern. For a normal case, the residuals appears to be randomly scattered around a horizontal line through 0. 34 Checking for constant variance 35 Checking linearity   When the relationship between two variables is not linear, you can sometimes transform the variables to make the relationship linear, for example, take logarithm, sine, exponential, etc. Scale plot of female life expectancy against natural log of phones per 100. 36 Multiple Regression Models  Considering the country.sav data, you are interesting to predict female life expectancy from      Urban: percentage of the population living in urban areas Docs: number of doctors per 10,000 people Beds: number of hospital beds per 10,000 people Gdp: per capita gross domestic product in dollars Radios: radios per people 37 Multiple Regression Models  A linear regression model is Predicted life expectancy  Constant  B1urban  B2 doc  B3 beds  B4 gdp  B5 radios  Scatterplot matrix is useful. 38 Scatterplot matrix 39 Scatterplot matrix    The relationship between female life expectancy and the percentage of the population living urban areas appears to be more or less linear. The other four independent variables appear to be related to female life expectancy, but the relation is not linear. We take log of the values of the four independent variables. 40 41 Correlation matrix C or re l at io ns Pearson Correlation Sig. (1-tailed) N Female life expectancy Natural log hospital beds/10,000 Natural log of doctors per 10000 Natural log of GDP Natural log of radios per 100 people Percent urban Female life expectancy Natural log hospital beds/10,000 Natural log of doctors per 10000 Natural log of GDP Natural log of radios per 100 people Percent urban Female life expectancy Natural log hospital beds/10,000 Natural log of doctors per 10000 Natural log of GDP Natural log of radios per 100 people Percent urban Female life expectancy 1.000 Natural log hospital beds/10,000 .730 Natural log of doctors per 10000 .880 Natural log of GDP .836 Natural log of radios per 100 people .693 Percent urban .697 .730 1.000 .711 .741 .616 .576 .880 .711 1.000 .824 .633 .763 .836 .741 .824 1.000 .716 .748 .693 .616 .633 .716 1.000 .579 .697 . .576 .000 .763 .000 .748 .000 .579 .000 1.000 .000 .000 . .000 .000 .000 .000 .000 .000 . .000 .000 .000 .000 .000 .000 . .000 .000 .000 .000 .000 .000 . .000 .000 116 .000 116 .000 116 .000 116 .000 116 . 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 116 42 Regression coefficients  The estimated regression model Y=40.78-0.007 urban + 3.96 lndocs + 1.17 lnbeds +1.63 lngdp +1.54 lnradio C oe ff i ci en t sa Model 1 (Constant) Natural log hospital beds/10,000 Natural log of doctors per 10000 Natural log of GDP Natural log of radios per 100 people Percent urban Unstandardized Coefficients B Std. Error 40.767 3.174 Standardized Coefficients Beta t 12.845 Sig. .000 1.147 .749 .095 1.532 .128 4.069 .563 .569 7.228 .000 1.709 .616 .236 2.776 .006 1.542 .686 .130 2.247 .027 -.020 .029 -.045 -.686 .494 a. Dependent Variable: Female life expectancy 43 SPSS output: model summary statistics V ar ia b le s E nt er e d/ Re m ov edb Model 1 Variables Entered Percent urban, Natural log hospital beds/10,00 0, Natural log of radios per 100 people, Natural log of doctors per 10000, Natural a log of GDP Variables Removed Method . Enter a. All requested variables entered. b. Dependent Variable: Female life expectancy M od el Su mm a ryb Adjusted R Std. Error of Model R R Square Square the Estimate 1 .910a .827 .819 4.742 a. Predictors: (Constant), Percent urban, Natural log hospital beds/10,000, Natural log of radios per 100 people, Natural log of doctors per 10000, Natural log of GDP b. Dependent Variable: Female life expectancy 44 SPSS output: ANOVA   This regression is meaningful as the significance level is less than 0.0005. The residual variance is 22.489 A NO VA b Model 1 Regression Residual Total Sum of Squares 11844.633 2473.807 14318.440 df 5 110 115 Mean Square 2368.927 22.489 F 105.336 Sig. .000a a. Predictors: (Constant), Percent urban, Natural log hospital beds/10,000, Natural log of radios per 100 people, Natural log of doctors per 10000, Natural log of GDP b. Dependent Variable: Female life expectancy 45

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mining Frequent Patterns Without Candidate Generation