Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Teaching Basic Modeling Skills Using Real Data Steven Gordon Senior Director of Education and Client Services [email protected] Goals of the Session • Overview of statistical methods that can be used to build a model • Sources of real data to test hypotheses about statistical relationships • Example exercise(s) and techniques for downloading and extracting data, testing statistical relationships, and building a model based on the results Measuring the Strength of a Relationship • Correlation – Statistical relationship between two variables – Goes between -1.0 and 1.0 – Zero means no relationship • http://argyll.epsb.ca/jreed/math9/strand4/scatterPl ot.htm Linear Regression • Assumes a cause and effect between one dependent variable and one or more independent variables • Solution of the linear equation with a best fit to data • Y = aX + b where: Y = the dependent variable a = a coefficient equivalent to the slope of the line b = the Y intercept of the line (the place where it crosses the Y axis) • Y = a1X1 + a2X2 + a3X3 + b for multiple causes •Use the vertical offsets to estimate Y for a given X •Square the differences so all numbers are positive •Sum of the squared offsets or deviations is a measure of the goodness of fit •R2 or coefficient of determination (0 to 1.0) Using Excel for Regression Analysis • Open the file Regress_Data.xls – File shows a dataset with 40 observations showing Income, Education, and Age – What would we hypothesize to be the relationships? Which is the dependent variable? Independent variables? • Need to activate the data analysis toolpak – Click on Tools from main menu in Excel – if Data Analysis is a submenu choice – good to go – Otherwise – Select Tools> Add-Ins – Mark the box next to the Analysis Toolpak and click OK Do a Correlation Analysis • Click on Tools – Data Analysis • Choose Correlation from the pull-down menu • Click in the input range box – Choose all three columns including the labels by clicking and dragging across the entire dataset • Click the Labels box to indicate the first row is labels • Click New Worksheet and give it a name correlate • What are the results? Do a Regression • Click Tools, Data Analysis • Select Regression from the pull down menu • Put your cursor in the field for the input Y Range – Select the range for the dependent variable, including its label • Put the cursor in the Input X Range – Select the other two columns of number • Mark the check boxes – Labels – Confidence Level • Click on the radio button for New Worksheet and give it a name • Click ok Interpreting Regression Outputs in Excel Regression Statistics Multiple R 0.638081096 R Square 0.407147485 Adjusted R Square 0.375944721 Standard Error 7556.02062 Observations 41 37.6 percent of variance explained after adjusting for degrees of freedom More interpretation ANOVA df Regression Residual Total 2 38 40 SS 1489961186 2169551009 3659512195 MS 744980593.1 57093447.61 F 13.04844294 Significance F 4.85274E-05 Significance of the equation. Probability of a false positive is less than 0.004 % Coefficients and Their Significance • Equation: Income = -17954.1 + 440.71*Age+ 1542.41*Education Coefficients Standard Error t Stat Intercept -17954.1 9658.880575 -1.858822842 Age (years) 440.7319 99.06262364 4.449022886 Ed. (yrs) 1542.411 617.4511538 2.498028499 P-value 0.070809952 7.29681E-05 0.016933884 Significance of individual coefficients Fitting Non-Linear Data • Same principle and measurement of the deviations • Choice of curve to fit not automatic – Individual choice with possibility of error – Real relationship may not be fully represented by experimental data Potential Errors • Chose the wrong function for non-linear data • Experiments did not measure all possible circumstances – Behavior may change in areas outside the sample data – E.G. physical limitations of system lead to failure • Variables may be highly correlated and have strong relationship in regression but one is not the cause of the other Exercise Using Real Data