* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ANOVA & Regression
Survey
Document related concepts
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Foundations of statistics wikipedia , lookup
Degrees of freedom (statistics) wikipedia , lookup
Misuse of statistics wikipedia , lookup
Analysis of variance wikipedia , lookup
Transcript
ANOVA & Regression Selecting the Correct Statistical Test Analysis of Variance Is used when you want to compare means for three or more groups. You have a normal distribution (random sample or population). It can be used to determine causation. It contains an independent variable that is nominal and a dependent variable that is interval/ratio. Other properties of both t-test and ANOVA Assumes equal variance (equal size or number of observations in each group). Samples for both t-test and ANOVA should be “independent” - this means that separate groups should have different members. Memberships should not overlap between groups. Calculations are based on degrees of freedom. (You will see degrees of freedom on the SPSS print out. DF for t-test is n (number of observations – 1). As with chi-square, degrees of freedom represent: Ability of numbers in the data set to vary. DF in ANOVA is a bit more complex. Calculations are based on the difference in means between each group and within each group. Therefore Degrees of Freedom between groups are n (number of groups). Degrees of Freedom within groups are the number of observations in each group (n) – 1, then you add the total degrees of freedom for each group. For example, if we had three groups for whom we have scores on the Depression Test AA 30 52 24 60 19 57 45 Individual Treatment 60 34 56 27 42 51 Group Counseling 25 49 37 52 Degrees of Freedom Between Groups = (n –1) = 3 –1 = 2 Within Groups = Sum of (n-1) for each group (7-1) + (6-1) + (5-1) = 6 + 5 + 4 = 15 Reading the ANOVA print-out Report Hi ghes t Year of School Com pleted Race of Respondent White Black Other Total Mean 13.06 11.89 12.47 12.88 N 1262 199 49 1510 Std. Deviati on 2.955 2.677 4.001 2.984 ANOVA Highes t Year of School Completed Sum of Squares Between Groups 240.725 Within Groups 13195.99 Total 13436.72 df 2 1507 1509 Mean Square 120.362 8.756 F 13.746 Sig. .000 Testing a Hypothesis with ANOVA If our confidence level is .01 Alternative Hypothesis: Ethnicity is associated with years of education completed Null hypothesis: There is no association between ethnicity and years of education completed. F = 13.746 p = .000 Do we confirm or reject the null hypothesis? Regression Analysis: Allows us to look at causation using two interval/ratio variables. Involves predicting the value of the dependent variable using the independent variable. Other control variables can be added to the regression analysis. Calculation for Regression is based on: The concept of the regression line. What points in the association between two variables are on or off the regression line. For simple or two variable regression: y = a + bx where a = the y-intercept and b = the slope of the line. Slope = the amount y increases for each unit of the increase in X. X = the x (independent variable value) used to predict Y (dependent variable value) Regression line when looking at association between two variables 100000 80000 60000 40000 20000 0 0 20000 40000 Current Salary 60000 80000 100000 120000 140000 Control Variables are Those variables that when combined with the independent variable may affect the value of the dependent variable. For example when we look at the association between beginning salary and current salary, both age and gender may affect salary amounts Regression SPSS print out Mo del Su mm ary Model 1 R .881a Adjust ed R Square .775 R Square .776 St d. E rror of the Es timate $8,096.337 a. Predic tors: (Constant), Minority Classifi cation, Beginning Salary ANOVA b Model 1 Regres sion Residual Total Sum of Squares 1.1E+11 3.1E+10 1.4E+11 df 2 471 473 Mean Square 5.352E+10 65550668.16 F 816.484 Sig. .000a a. Predictors: (Constant), Minority Class ification, Beginni ng Salary b. Dependent Variable: Current Sal ary Co effi cien tsa Model 1 (Const ant) Beginning Salary Mi nori ty Class ification Unstandardized Coeffic ient s B St d. E rror 2516.971 945.359 1.896 .048 -1632. 896 909.959 a. Dependent Variabl e: Current S alary St andardiz ed Coeffic ient s Beta .874 -.040 t 2.662 39.583 -1. 794 Si g. .008 .000 .073 Let’s check on what this means about minority classification and salary Report Current Sal ary Mi nority Classifi cation No Yes Total Mean $36023.3 $28713.9 $34419.6 N 370 104 474 Std. Deviati on ********* ********* ********* Hypothesis Testing: Confidence level: = .05 Alternative Hypothesis is: Controlling for minority status, beginning salary is associated (or can predict) current salary. Null hypothesis is. Controlling for minority status, beginning salary is not associated (or can predict) current salary. Analyzing regression Can use three values to interpret – (1) R2 - Correlation between any independent and control variables and the dependent variable. (1) F – goodness of fit of the regression line. Calculated based on the number of points off the line. (2) b – measure of the correlation between one variable in the regression model and the dependent variable. This is used when you include multiple independent or control variables in the model. Hypothesis Test (continued) Total correlation between the independent and control variables and the dependent variables = R2 = .776 (note no p value – but the closer the R2 is to 1.00 the better. This means that there is a high correlation between minority classification and beginning salary combined and current salary. Total fit of the model to the regression line = F = 816, p. = .00 (less than our confidence level of .05) Alternative hypothesis confirmed Individual Beta values for beginning salary (.874 at p. = .00 and minority status (.040 at p = .073). At p. = .05 CL only beginning salary is statistically significant or associated with current salary. Review of statistical tests Statistical Test statistic Test Chi-square T-test T ANOVA F Correlation r 2 or F for the fit of the model and b for the Use R Regression correlation between each of the independent and control variables and the dependent variable. General rules for analyzing results The bigger the test statistic the more likely there is a relationship between the independent and dependent variables. Values greater than 3 are for every type of inferential statistic other than correlation are usually statistically significant. Relationships can be positive or negative. You need the p value to determine if the test statistic is actually large enough to be statistically significant. You must always set a confidence level before determining if the p value is large enough to be statistically significant. Findings from small samples are unlikely to be significant unless there is a very strong relationship between two variables. How do we write up test results We use the test statistic and the probability level. Correct procedure for professional journal articles also requires the use of degrees of freedom and number of observations. For Assignment #4 use the test statistic and the probability level. Proper format for this class The confidence level is p. = .05. Reject the null hypothesis and accept the alternative hypothesis. Correlation is r = .74 at p. = .04. The confidence level is p. = .10.Accept the null hypothesis and reject the alternative hypothesis. There is no association between years of education and salary, controlling for gender; F = .45, p. = .70. Criteria for Using Statistical Tests Independent samples Level of Measurement Normal distribution Sample Size (Minimum for quantitative research should be 30) Robustness (can procedure be used when basic assumptions are violated?) T-test, ANOVA, and chi-square are considered very robust. Research note: Some types of ordinal data can be used as interval/ratio data in statistical analysis. Montcalm and Royse state that such data should be ranked at a least five levels, come from a normal distribution, and result from a large sample. The most common type of ordinal data used as ratio/interval data in statistics is a likert scale. Example of a likert scale 1 = Very satisfied 2 = Satisfied 3 = Neutral 4 = Unsatisfied 5 = Very unsatisfied. Usually presented as a ranking ( 1 to 5), implies an equal distance among the categories. If you do not have a random sample, it is proper to use nonparametric statistics: Small sample size. No normal distribution or random sampling. More than one mode. Many outliers in the data set. Dependent variables are ordinal or dichotomous. SPSS Instructions for Running ANOVA Select Means Select One-way ANOVA Highlight your dependent variable (must be ratio) Click on the arrow Highlight your factor (independent) variable (must be nominal with at least three categories) Click o.k. SPSS instructions for running Regression Select Analyze Select Regression Select Linear Highlight Dependent Variable (must be ratio) Highlight two or independent or control variables Click on Arrow Click o.k. SPSS Instructions for Running Means Select Analyze Select Compare Means Select Means Highlight Dependent (Ratio) Variable Highlight Independent (Nominal) Variable Click ok