Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Foundations of statistics wikipedia , lookup
History of statistics wikipedia , lookup
Psychometrics wikipedia , lookup
Student's t-test wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Categorical variable wikipedia , lookup
Analysis of variance wikipedia , lookup
Comprehensive Exam Review Click the LEFT mouse key ONCE to continue Research and Program Evaluation Part 5 Click the LEFT mouse key ONCE to continue Analyses of Differences Recall that for purposes here, an analysis of difference involves at least one continuous variable and at least one discrete variable. In this context, the variable that is continuous is sometimes called the “dependent” variable, and the variable that is discrete is sometimes called the “independent” variable. The purpose of the analysis is to investigate differences in the continuous variable as a function of the categories in the discrete variable. Think about this possibility for a while. Imagine that the same test was given to a group of people on many occasions, but on each occasion the test was administered, they had not taken the test previously. Then, imagine that the mean for the test was computed for each occasion it was administered. If a graph was made of the various means for the group against the frequency of occurrence of the respective means, the result would be a normal distribution of the means (because the various factors affecting test performance would come together in different ways on different occasions). This very special distribution is known as the theoretical distribution of sampling means f values of the means This distribution represents 100% of the possible means that the group might achieve on any occasion. In other words, there is 100% probability that the mean would fall between these lines on this graph f Because this is a “normal distribution,” all of its mathematical properties are known. For example, it is symmetric about the mean and the standard area percentages under the curve are known. f 34% 2% 14% 34% 14% 2% This theoretical distribution can be grounded in reality if one assumption is accepted: That an observed mean (i.e., one from an actually administered test or measurement) is the mean of the theoretical distribution of sampling means. Generally, this assumption is presumed valid unless there is specific information that the assessment situation was something other than “normal.” Once the test is given, the areas under the curve can be related to any mean if f 34% the distance between the observed mean and these points is known. 2% 14% 34% 14% Observed Mean 2% These distances are known as the “Standard Error of the Mean.” f A “standard error” is a standard deviation of a theoretical distribution. 34% 2% -3SE_ X 14% -2SE_ X -1SE_ X 34% 14% +1SE_ X Observed Mean +2SE_ X 2% +3SE_ X f 34% 2% -3SE_ X 14% -2SE_ -1SE_ X X 34% 14% 2% +1SE_ +2SE_ +3SE_ X X X There is approximately a two-thirds chance (68%) that the mean for the group will fall between +/- one (1) standard error of the mean on any occasion. f 34% 2% -3SE_ X 14% -2SE_ -1SE_ X X 34% 14% 2% +1SE_ +2SE_ +3SE_ X X X Similarly, the probability, or likelihood, of the mean falling between +/- two (2) standard errors of the mean on any occasion is approximately 96%, and so on. Two of the more useful statements that can be made are: There is a 95% probability that the mean will fall between +/- 1.96 standard errors of the mean on any occasion. There is a 99% probability that the mean will fall between +/- 2.58 standard errors of the mean on any occasion. These “confidence limits” look like this on the theoretical distribution of sampling means. 99% 95% +/-1.96 SE_ X +/-2.58 SE_ X Now assume a situation in which the same thing is measured (i.e., using the same test or measure) on two different occasions for the same group (a la pre-post testing). If the group did not have exactly the same mean for each testing occasion, there was a difference between the means. That difference happened either because something caused the difference or by chance. The important question is, “What is the likelihood (i.e., probability) that the difference happened simply by chance?” Graphically (at the .05 level), it looks like this: 95% PRE +/- 1.96 SE_ X The question is, “Is the post mean inside or outside of the 95% confidence limits for the pre mean?” What a statistically significant difference looks like graphically... Non-Significant Significant Significant PRE Confidence limits at +/- 1.96 SE_ X The t-test is a statistical significance test that covers this situation. The t-test is used to determine if there is a statistically significant difference between only two means. The t-test is appropriate for use when data from 30 or fewer subjects are being analyzed. The t-test is sometimes referred to as the “Student’s t-test.” There are two types of t-tests. A dependent, or correlated, t-test is used when the difference between the means of the same group assessed on two occasions is being evaluated (e.g., pre-post). An independent, or uncorrelated, t-test is used when the difference between the means of two separate groups is being evaluated (e.g., males and females). A t-test yields a statistic called a t value. Computer programs generating the t value also present the (exact) probability of obtaining a t value of that magnitude. The (exact) probability calculated for the t value is compared to the (pre-determined) alpha level for the analysis. For the t-test, it was noted that the discrete variable (i.e., the one that has categories) is sometimes called the “independent” variable. The discrete variable is also known as a “factor.” It is important to remember that this is a different and distinct use from “factor analysis,” which was a type of analysis of relationships. In the context of analyses of differences, a factor is a variable that is discrete (i.e., has categories) and is sometimes called the independent variable. In the context of analyses of differences, the categories of a factor are called “levels.” In some ways, again this is a poor choice of words because “levels” implies some type of hierarchy - but that’s not really what it means in this context. Suppose “gender” as a (discrete or independent) variable is included in a study. In the study, gender would be a “factor” having two “levels” (i.e., male and female). Remember that levels = categories; no hierarchy is necessarily applicable. A t-test would be the appropriate analysis for a study having only one factor that has two levels. The levels (categories) of the factor may be “uncorrelated” (e.g., gender) or “correlated” (e.g., pre-post). Instead of “correlated,” the phrase “repeated measures” is used to indicate that the levels of a factor are actually two or more measurements on the same group of people as part of a single research study. Suppose that instead of viewing gender as either male or female, it was considered “sexrole orientation.” The possible categories might then be male, female, and androgynous, which would be the three levels of the “sex-role orientation” factor. Then suppose a measure of “counseling effectiveness” could be obtained for everyone in each of the three groups. One question might then be, “Are there statistically significant differences among the counseling effectiveness means of the three groups?” Graphically, the possibilities would be: A A F A M A A FM F M M F MF The appropriate analysis for this situation is a one-way analysis of variance. It’s called “one-way” because there is only one factor involved. This is one of several types of analyses of variance, all of which are abbreviated “ANOVA.” A one-way ANOVA is appropriate when there is one factor in the study. The factor may have three or more levels. The levels may be either uncorrelated (e.g., three categories of sex-role orientation) or correlated (e.g., pre-post-follow-up for an experimental study). A one-way ANOVA yields an F statistic (or as it is more commonly known, an F value). Theoretically, a one-way ANOVA works with a factor with as many levels as are relevant and/or desired. Computer programs generate an exact probability for the F value, which can then be compared to the alpha level. A statistically significant F value means that there is at least one statistically significant difference among the means. However, a statistically significant F value does not indicate which means are significantly different from one another. A “multiple comparison” is a statistical procedure that allows determination of which means are statistically significantly different from another. A multiple comparison is only appropriate following a statistically significant F value. A multiple comparison allows determination of which of these patterns exists (and more than one may apply): A A F A M A A FM F M M F MF Multiple comparison procedures range on a continuum of “liberal” to “conservative.” The more liberal the procedure, the smaller the difference needed to be considered statistically significantly different. More conservative procedures reduce the chance for Type I error, but make it more difficult to achieve a statistically significant difference. Some of the multiple comparison methods include: liberal Pairwise Comparisons (t-tests) Fisher LSD Duncan Multiple Range Test (Student) Newman-Keuls Tukey HSD conservative Scheffe A factorial analysis of variance (ANOVA) is appropriate when there are two or more factors, each of which has at least two levels. (Again, remember that a factorial ANOVA is not the same as factor analysis). Suppose the research question was, “What are the differences in graduate-level academic aptitude as a function of gender and race?” The variables might be as follows: The “dependent” variable is GRE Total Score. One factor is “gender,” and it has two levels: male (M) and female (F). Another factor is “race,” and it has three levels: African-American (AA), Hispanic-American (HA), and Caucasian-American (CA). Graphically, the research could be shown as: Race HA AA M Gender F GRE-T means CA One F value would be obtained for each factor: Fgender Frace These are known as the “main effects” F values. These F values are independent; the statistical significance of one is unrelated to the statistical significance of the other. An interaction F value also would be obtained. Fgender by race An interaction F value allows evaluation of whether the effects of one variable are consistent for all levels of the other variable. The interaction F value is independent of the other two. Graphically, it all looks like this: Race Gender HA AA CA M Fgender F Frace Fgender by race Now suppose another factor is added, such as “academic degree”(Master’s, Specialist, or Doctorate). F AA HA M CA AA HA CA M S DM S D M S D M S DM S D MS D There will be one F value for each factor (aka “main effects”). Fgender Frace Fdegree These F values are all independent of one another. If either Frace and/or Fdegree is statistically significant, a multiple comparison would be needed to determine the pattern of significant differences. There also would be three “two-way interactions.” Fgender by race Frace by degree Fdegree by gender There also would be one “three-way interaction,” which represents the combination of variables three at a time. Fgender by race by degree These F values also are independent of all the others. The t-test, one-way ANOVA, and factorial ANOVA are known as “univariate” analyses, because only one dependent (e.g., GRE Total score) variable is involved. If a second (or more) dependent variable is added, the appropriate analysis is a multivariate analysis of variance (MANOVA). A MANOVA also yields an F value. If the Fmultivariate is NOT significant, it means that there are no significant differences anywhere among the sets of means. If the Fmultivariate is statistically significant, appropriate univariate analyses must be computed to determine which means are significantly different from one another. Graphically, analyses of differences can be summarized as follows: Analysis of Difference Dependent Variables Levels Uncorrelated Repeated Measures Factors (Student’s) t-test 1 1 2 Yes Yes One-way ANOVA 1 1 3 Yes Yes Factorial ANOVA 1 2 2 Yes Yes MANOVA 2 2 2 Yes Yes Nonparametric Statistics So-called nonparametric statistics are used when the data are nominal or ordinal, or when the data are interval but the assumption of a normal distribution of the variable cannot be met. In general, there are nonparametric statistical analyses that “parallel” most parametric statistical analyses. The following are commonly used nonparametric “correlational” techniques, most derived from the Pearson Product-Moment Correlation Coefficient. Spearman’s Rho is a correlation coefficient appropriate when the data being correlated are ranks (i.e., ordinal data). A Point Biserial Correlation is appropriate when one of the variables is continuous and the other is dichotomous. A Biserial Correlation is appropriate when both variables are actually continuous, but one is being treated as a dichotomous variable. A Tetrachoric Correlation is appropriate when both variables are actually continuous, but both are being treated as dichotomous. A Phi Coefficient is appropriate when both variables are actually dichotomous. A Coefficient of Contingency is appropriate when one or both of the (nominal) variables has three or more categories. The following are commonly used nonparametric tests of differences. The Median Test is appropriate to use to test the significance of difference between the medians of two independent samples. A Sign Test is appropriate to test the significance of difference between two or more sets of paired observations (i.e., measurements). The Wilcoxon Rank Sum Test is appropriate to test the significance of difference when the data from two independent samples can be assigned ranks. The Mann-Whitney U Test is essentially the same as the Wilcoxon Rank Sum Test, but is often used with smaller samples. The Kruskal-Wallis is essentially a one-way analysis of variance appropriate to use to test the significance of difference among three or more sets of ranks. The Chi Square Test, which is the most commonly used nonparametric statistic, is a test of the magnitude of discrepancy between observed (i.e., measured) and expected distribution frequencies. The Chi Square Test is used either as a “goodness of fit” test or as a test of inde-pendence. The Chi Square “goodness of fit” Test is usually used to test the degree of independence between observed and (theoretically) expected frequencies for a single variable. The Chi Square Test as a test of independence is used to test the degree of independence between the observed and expected frequencies for two variables. Because the distributions to which the various nonparametric statistics are applied vary considerably, methods to evaluate the statistical significance of the various statistics generated are unique to the various techniques. However, similar to most parametric statistics, the resultant nonparametric statistical value is evaluated against its probability as a chance occurrence. Needs Assessment and Program Evaluation A fundamental question in the counseling professions is: How can we integrate good needs assessment and program evaluation practices to yield an effective and comprehensive understanding of a service delivery system? One commonly accepted approach is to follow the CIPP model, in which CIPP is an acronym for: Context evaluation Input evaluation Process evaluation Product Evaluation Context evaluation: is essentially equal to needs assessment within the CIPP model. necessitates clear specification of potential service recipients. involves gathering data directly from potential service recipients. should point to program goals and objectives. Primary context evaluation methods include use of surveys and/or interviews. Effective context evaluation provides answers to questions such as: What is the diversity of the needs expressed among the potential service recipients? What are the priorities among the various categories of needs expressed? Do the needs expressed reflect current or future circumstances? Which of the expressed needs are in concert with program activities? Input evaluation: serves to identify available resources and constraints for a service delivery program. follows directly from a context evaluation (i.e., needs assessment). yields the parameters within which the program can and should be conducted. Effective input evaluation will provide answers to questions such as: What is the environment (i.e., physical space and material resources) available for the program? What are the fiscal resources available for the program? What are the human or personnel resources available for the program? What rules (and/or entities) govern the conduct of the program? Together, the results of context and input evaluations determine the nature of the accountability for the program and for the program participants. That is, they allow determination of who will be accountable to whom, how, and for what. Process evaluation: is concerned with the effectiveness of the day-to-day operation of the program. is used interchangeably with the term “formative” evaluation. provides data upon which to base service delivery program decisions while the program is in operation. Effective process evaluation will provide answers to questions such as: What is the efficiency level of the service delivery program? What factors influence the expenditure of funds within the service delivery program? How efficient is the service delivery program schedule? What factors influence decision-making processes in the service delivery program? Product evaluation: allows determination of the actual “outcomes” of the service delivery program. is used interchangeably with the term “summative” evaluation. is often considered the “bottom line” in accountability processes. Effective product evaluation provides answers to questions such as: To what extent are the service delivery program’s goals and objectives being met? What are the service delivery program’s impacts in terms of identifiable changes? What is the service delivery program’s cost-benefit ratio? What are the “lost opportunity” costs attributable to intra-program changes? The CIPP model is circular... The best evaluation evolves from a fully integrated cycle of all four parts of the model Input Process Product Content The CIPP model and accountability are integrally linked because any service delivery program should be held accountable for: what it is attempting to accomplish (content evaluation), what resources it uses (input evaluation), how resources are used (process evaluation), and what happens as a result of the program (product evaluation). Seven types of accountability have been identified in the professional counseling and development literature. Service delivery accountability addresses the question, “To what extent does the program deliver the services it promises to deliver?” Ethical accountability addresses the question, “To what extent are services delivered within the parameters of acceptable ethical practice?” Legal accountability addresses the question, “To what extent are services delivered within the parameters of legal constraints?” Coverage accountability addresses the question, “To what extent does the service delivery program serve all of the people it purports to serve?” Efficiency accountability addresses the question, “To what extent is time used efficiently in delivery of the service program?” Fiscal accountability addresses the question, “To what extent are available fiscal resources used in a manner that maximizes the likelihood of positive program outcomes?” Impact accountability addresses the question, “To what extent does the service delivery program actually make positive changes in peoples’ lives?” This concludes Part 5 of the presentation on RESEARCH AND PROGRAM DEVELOPMENT