* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download - SC08EdProgK12
Survey
Document related concepts
Transcript
Teaching Basic Modeling Skills Using Real Data Steven Gordon Senior Director of Education and Client Services [email protected] Goals of the Session • Overview of statistical methods that can be used to build a model • Sources of real data to test hypotheses about statistical relationships • Example exercise(s) and techniques for downloading and extracting data, testing statistical relationships, and building a model based on the results Measuring the Strength of a Relationship • Correlation – Statistical relationship between two variables – Goes between -1.0 and 1.0 – Zero means no relationship • http://argyll.epsb.ca/jreed/math9/strand4/scatterPl ot.htm Linear Regression • Assumes a cause and effect between one dependent variable and one or more independent variables • Solution of the linear equation with a best fit to data • Y = aX + b where: Y = the dependent variable a = a coefficient equivalent to the slope of the line b = the Y intercept of the line (the place where it crosses the Y axis) • Y = a1X1 + a2X2 + a3X3 + b for multiple causes •Use the vertical offsets to estimate Y for a given X •Square the differences so all numbers are positive •Sum of the squared offsets or deviations is a measure of the goodness of fit •R2 or coefficient of determination (0 to 1.0) Using Excel for Regression Analysis • Open the file Regress_Data.xls – File shows a dataset with 40 observations showing Income, Education, and Age – What would we hypothesize to be the relationships? Which is the dependent variable? Independent variables? • Need to activate the data analysis toolpak – Click on Tools from main menu in Excel – if Data Analysis is a submenu choice – good to go – Otherwise – Select Tools> Add-Ins – Mark the box next to the Analysis Toolpak and click OK Do a Correlation Analysis • Click on Tools – Data Analysis • Choose Correlation from the pull-down menu • Click in the input range box – Choose all three columns including the labels by clicking and dragging across the entire dataset • Click the Labels box to indicate the first row is labels • Click New Worksheet and give it a name correlate • What are the results? Do a Regression • Click Tools, Data Analysis • Select Regression from the pull down menu • Put your cursor in the field for the input Y Range – Select the range for the dependent variable, including its label • Put the cursor in the Input X Range – Select the other two columns of number • Mark the check boxes – Labels – Confidence Level • Click on the radio button for New Worksheet and give it a name • Click ok Interpreting Regression Outputs in Excel Regression Statistics Multiple R 0.638081096 R Square 0.407147485 Adjusted R Square 0.375944721 Standard Error 7556.02062 Observations 41 37.6 percent of variance explained after adjusting for degrees of freedom More interpretation ANOVA df Regression Residual Total 2 38 40 SS 1489961186 2169551009 3659512195 MS 744980593.1 57093447.61 F 13.04844294 Significance F 4.85274E-05 Significance of the equation. Probability of a false positive is less than 0.004 % Coefficients and Their Significance • Equation: Income = -17954.1 + 440.71*Age+ 1542.41*Education Coefficients Standard Error t Stat Intercept -17954.1 9658.880575 -1.858822842 Age (years) 440.7319 99.06262364 4.449022886 Ed. (yrs) 1542.411 617.4511538 2.498028499 P-value 0.070809952 7.29681E-05 0.016933884 Significance of individual coefficients Fitting Non-Linear Data • Same principle and measurement of the deviations • Choice of curve to fit not automatic – Individual choice with possibility of error – Real relationship may not be fully represented by experimental data Potential Errors • Chose the wrong function for non-linear data • Experiments did not measure all possible circumstances – Behavior may change in areas outside the sample data – E.G. physical limitations of system lead to failure • Variables may be highly correlated and have strong relationship in regression but one is not the cause of the other Exercise Using Real Data