* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CNM Handbook for Educational Research (doc)
Survey
Document related concepts
Transcript
CNM HANDBOOK FOR EDUCATIONAL RESEARCH Central New Mexico Community College Ursula Waln, Director of Student Learning Assessment, and Fred Ream, Mathematics Instructor 2014 Table of Contents Purpose of this Handbook ...................................................................................................................................... 1 IRB and Educational Research .............................................................................................................................. 1 How the IRB Came to Be ................................................................................................................................... 1 The Role of the IRB ........................................................................................................................................... 2 Research Involving Human Subjects ................................................................................................................. 2 IRB Exemptions ................................................................................................................................................. 3 IRB Criteria ........................................................................................................................................................ 4 Informed Consent ............................................................................................................................................... 5 Research Design..................................................................................................................................................... 6 Quantitative and Qualitative Options ................................................................................................................. 6 Sampling Methods.............................................................................................................................................. 9 Statistics ............................................................................................................................................................... 10 Data Types........................................................................................................................................................ 10 Analytic Approaches ........................................................................................................................................ 10 Figure 1: Statistical Test Decision-Making Flowchart ................................................................................. 11 Refresher of Basic Concepts ............................................................................................................................ 12 Figure 2: Normal Distribution ...................................................................................................................... 14 Figure 3: Illustration of Outliers ................................................................................................................... 15 Analytic Software Programs ............................................................................................................................ 16 Preliminary Data Analysis Tips ....................................................................................................................... 16 Glossary ............................................................................................................................................................... 17 References ............................................................................................................................................................ 19 Appendix A: Recommendations from CNM’s IRB............................................................................................. 20 Appendix B: Preview of CNM IRB Application Form ....................................................................................... 21 PURPOSE OF THIS HANDBOOK This handbook was developed to foster depth of intentionality and critical analysis in CNM student outcomes assessment. It builds on the information presented in the CNM Handbook for Outcomes Assessment and is based on the premises that 1) student outcomes assessment is a form of educational research, and 2) considering one’s assessment efforts as research may suggest more creative, purposeful, and meaningful approaches. The following pages offer information and tips for those who wish to develop a greater degree of sophistication in their assessment approach and/or have their assessment efforts render more useful information. IRB AND EDUCATIONAL RESEARCH If you are interested in conducting research that is not purely operational in nature, is experimental, is for a dissertation or master’s thesis, or is going to be used outside of CNM in any way, your project requires prior approval by the CNM Institutional Review Board (IRB). If you think there is any possibility at all that you will ever want to publish your findings, use the data for other publishable research, or do anything else with your findings outside of CNM, please get IRB approval before beginning any data collection. The question of whether your particular study must have IRB approval hinges on whether or not it meets the definition of ‘research involving human subjects.’ However, for any study that is not operational in nature and going to be used only within CNM, the IRB, not the researcher, must make this determination, based on an application. If you have any doubt at all, submit an application. The process itself can help you organize your research plans, so the time won’t be wasted, and you’ll know you covered your bases. A requirement for IRB approval need not deter faculty from pursuing educational research projects. Indeed, structured research into student learning not only has great potential for informing internal improvement efforts, but also can produce insights with the potential to impact the broader profession and/or instructional community. The application form for submitting a project for IRB approval is shown in Appendix B and can be accessed at http://www.cnm.edu/depts/planning/instres/irb. If you are absolutely certain your project does not require IRB approval and you already know all you need to know about the IRB, you may want to skip ahead to the research design section, page 11. How the IRB Came to Be The existence of IRBs for certification of research conducted at colleges and university is required under The National Research Act of 1974. Although certainly not the first example of a societal effort to regulate experimentation involving humans, this federal statute came about in response to medical research that violated ethical principles deemed fundamental to American values and basic human rights. Salient examples include the medical experiments conducted in Nazi concentration camps between 1939 and 1945, (which led to the establishment of the first international code of research ethics, the Nuremburg Code) and the U.S. Public Health Service’s Tuskegee Syphilis Study of 1932-1972, in which 400 African American men known to have syphilis had information and treatment withheld so that researchers could observe the progression of the disease (Schneider). In 1974, Congress established the National Commission for the Protection of Human Subjects of Biomedical and Behavioral Research and tasked it with identifying ethical principles and developing guidelines for the conduct of research involving human subjects. In 1979, the Commission published the Belmont Report, which summarized three basic principles: 1. Respect for persons o Individuals should be treated as autonomous agents 1 o Persons with diminished autonomy are entitled to additional protections 2. Beneficence o Do no harm o Maximize possible benefits and minimize possible harms 3. Justice o Requires that individuals and groups be treated fairly and equitably in terms of bearing the burdens and receiving the benefits of research -- NIH, p. 21 The Belmont Report principles and guidelines were codified in the Title 45 Code of Federal Regulations, Part 46, and oversight was given to the U.S. Department of Health, Education and Welfare (which later split into the Department of Health and Human Services and the Department of Education). In 1991, The U.S. Department of Education became one of fifteen federal departments and agencies to codify in separate regulations the “Common Rule,” including in its 34 CFR Part 97 language identical to 45 CFR 46, Subpart A (NIH). Today, seventeen federal agencies fall under the Common Rule, and oversight for all is provided by the Office for Human Research Protections, a division of the U.S. Department of Health and Human Services. The Role of the IRB At the institutional level, “The IRB is an administrative body established to protect the rights and welfare of human research subjects recruited to participate in research activities conducted under the auspices of the institution with which it is affiliated” (IRB Guidebook, p. 1). IRBs determine “the acceptability of proposed research in terms of institutional commitments and regulations, applicable law, and standards of professional conduct and practice” (45 CFR 46.107). The major roles of IRBs in the oversight of research are: 1. Evaluating the respect for persons, beneficence, and justice of all research activities proposed under the auspices of the institution and providing approval or disapproval accordingly. 2. Ensuring that the process proposed to collect informed consent meets regulatory requirements. 3. Overseeing the progress and protocols of ongoing research studies. -- NIH, p. 84 Research Involving Human Subjects The CFR provides the following definitions: …(d) Research means a systematic investigation, including research development, testing and evaluation, designed to develop or contribute to generalizable knowledge. Activities which meet this definition constitute research for purposes of this policy, whether or not they are conducted or supported under a program which is considered research for other purposes. For example, some demonstration and service programs may include research activities. (e) Research subject to regulation, and similar terms are intended to encompass those research activities for which a federal department or agency has specific responsibility for regulating as a research activity, (for example, Investigational New Drug requirements administered by the Food and Drug 2 Administration). It does not include research activities which are incidentally regulated by a federal department or agency solely as part of the department's or agency's broader responsibility to regulate certain types of activities whether research or non-research in nature (for example, Wage and Hour requirements administered by the Department of Labor). (f) Human subject means a living individual about whom an investigator (whether professional or student) conducting research obtains (1) Data through intervention or interaction with the individual, or (2) Identifiable private information. Intervention includes both physical procedures by which data are gathered (for example, venipuncture) and manipulations of the subject or the subject's environment that are performed for research purposes. Interaction includes communication or interpersonal contact between investigator and subject. Private information includes information about behavior that occurs in a context in which an individual can reasonably expect that no observation or recording is taking place, and information which has been provided for specific purposes by an individual and which the individual can reasonably expect will not be made public (for example, a medical record). Private information must be individually identifiable (i.e., the identity of the subject is or may readily be ascertained by the investigator or associated with the information) in order for obtaining the information to constitute research involving human subjects… -- 45 CFR 46.102 IRB Exemptions The CNM IRB, with endorsement from the VPAA, Deans’ Council, Chairs’ Council, and Faculty Senate, documented its Recommendations for Research in 2012, stating that “Some types of research are exempt from IRB review: student projects for class, data collection (e.g., surveys) by faculty or staff, assessment data collection, student course evaluations, and research by the college that is not designed for publication” (Appendix A). Whether or not an investigation is subject to IRB approval, the CNM IRB (on its homepage) recommends following the basic rules listed in its document Recommendations for Research. In the same vein, it is advisable for anyone planning educational research at CNM to be familiar with and follow the IRB criteria and informed consent guidelines in the following sections of this handbook, whether or not the research requires IRB approval. Forming the basis for the CNM statements of exemption are the following excerpts from 45 CFR Part 46, and the corresponding educational regulations in 34 CFR Part 97 (formatting enhanced for readability). …(b) Unless otherwise required by department or agency heads, research activities in which the only involvement of human subjects will be in one or more of the following categories are exempt from this policy: (1) Research conducted in established or commonly accepted educational settings, involving normal educational practices, such as (i) research on regular and special education instructional strategies, or (ii) research on the effectiveness of or the comparison among instructional techniques, curricula, or classroom management methods. 3 (2) Research involving the use of educational tests (cognitive, diagnostic, aptitude, achievement), survey procedures, interview procedures or observation of public behavior, unless: (i) information obtained is recorded in such a manner that human subjects can be identified, directly or through identifiers linked to the subjects; and (ii) any disclosure of the human subjects' responses outside the research could reasonably place the subjects at risk of criminal or civil liability or be damaging to the subjects' financial standing, employability, or reputation. (3) Research involving the use of educational tests (cognitive, diagnostic, aptitude, achievement), survey procedures, interview procedures, or observation of public behavior that is not exempt under paragraph (b)(2) of this section, if: (i) the human subjects are elected or appointed public officials or candidates for public office; or (ii) federal statute(s) require(s) without exception that the confidentiality of the personally identifiable information will be maintained throughout the research and thereafter. (4) Research involving the collection or study of existing data, documents, records, pathological specimens, or diagnostic specimens, if these sources are publicly available or if the information is recorded by the investigator in such a manner that subjects cannot be identified, directly or through identifiers linked to the subjects… For the CNM IRB Recommendations for Research, see Appendix A or go to https://www.cnm.edu/depts/planning/instres/irb/documents/RECOMMENDATIONS-FOR-RESEARCH.pdf. IRB Criteria The following CFR excerpts hit the highlights of the federal criteria for IRB approval: 1. Risks to human subjects are minimized… 2. Risks to human subjects are reasonable in relation to anticipated benefits, if any, to human subjects and the importance of the knowledge that may reasonably be expected to result… 3. Selection of human subjects is equitable… 4. Informed consent will be sought from each prospective research participant or the prospective research participant’s legally authorized representative in accordance with and to the extent required by §46.116. 5. Informed consent will be appropriately documented in accordance with and to the extent required by §46.117. 6. When appropriate, the research plan makes adequate provision for monitoring the data collected to ensure the safety of subjects 7. When appropriate, there are adequate provisions to protect the privacy of subjects and to maintain the confidentiality of data. -- 45 CFR 46.111 The federal regulations can be accessed in their entirety at http://www.hhs.gov/ohrp/humansubjects/guidance/45cfr46.html or at http://www2.ed.gov/policy/fund/reg/humansub/part97.html. In addition, an Institutional Review Board Guidebook, published by the HHS, is available at http://www.hhs.gov/ohrp/archive/irb/irb_guidebook.htm. 4 And, the CNM IRB web site and links to documents can be found at http://www.cnm.edu/depts/planning/instres/irb. Informed Consent The principle of respect for persons requires that investigators conducting research on human subjects obtain informed consent in the form of “[a] legally-effective, voluntary agreement that is given by a prospective research participant following comprehension and consideration of all relevant information pertinent to the decision to participate in a study” (NIH, p. 117). Autonomous adults can provide their own informed consent. (The age of majority in New Mexico is eighteen.) However, research involving children or other vulnerable populations requires informed consent from not only the subject (whenever the individual is reasonably capable of providing or denying consent), but also the subject’s legally authorized representative. In some cases (e.g., specific cultural settings), community or family consent may also be required. Informed consent consists of voluntariness, comprehension, and disclosure: Voluntariness: Consent must be given free of coercion, and the perceived value of any inducements must be small enough to avoid undue influence on the decision to participate. Care must be taken to ensure that subjects are not manipulated to ignore the risks of participation either through fear of repercussions for refusal or through motivation by excessive or inappropriate rewards. Offering bonus points toward a final course grade would be an example of an inappropriate incentive. However, in some cases, minimal compensation for participants’ time and effort may be appropriate. Comprehension: To provide informed consent, subjects must first understand the information presented to them sufficiently to make a reasoned, informed decision. Subjects must therefore have the cognitive capacity to understand (or else be represented by legally authorized individuals), and the information must be presented using language and methods of delivery appropriate to the people making the decision. Disclosure: Information that must be disclosed for participants to be able to provide informed consent include the purpose, risks, and benefits of the research; alternatives to the research protocol; the extent to which confidentiality can be protected; compensation in case of injury; contact information for questions or concerns; and conditions of participation, including the right to refuse or withdraw participation without penalty. (Compensation may be terminated proportionately upon withdrawal, but any offer of compensation for participation prior to the time of withdrawal must be honored. Telling prospective subjects they must complete the study or forfeit all compensation would be an example of a penalty.) Waivers regarding the disclosure of specific information may be considered by the IRB when the research poses no more than minimal risk to the participants, withholding the information will not adversely affect the rights or welfare of the participants, and the research could not practicably be carried out otherwise. When information is withheld, it should be disclosed following the conclusion of research unless doing so might cause harm. “No informed consent, whether oral or written, may include any exculpatory language through which the subject or the representative is made to waive or appear to waive any of the subject's legal rights, or releases or appears to release the investigator, the sponsor, the institution or its agents from liability for negligence” (45 CFR 46.116). Finally, informed consent should be viewed as an ongoing process, not just a one-time event. Participants should be kept informed and given opportunities to ask questions as the research is carried out. 5 RESEARCH DESIGN For the purposes of CNM student learning assessment, the primary consideration in designing your research should be what you want to know related to the development of your program competencies (student learning outcome statements). Using that as your starting point, you can employ ‘reverse design’ principles to map out the steps you will take toward gaining the insights you seek. Depending on the type of information you want to collect and what you want to do with it, your research design may or may not include the following: A written statement of the issue to be investigated Identification of the population to be studied A description of relevant background information A literature review A written rationale A hypothesis and/or null hypothesis (or just a statement of what you expect to find) Identification of methodology o Sampling methods o Instruments o Procedures for administration o Data analysis methods A timeline Quantitative and Qualitative Options Research approaches are often described as quantitative or qualitative, though the two descriptors more accurately represent ends of a continuum than distinct categories. Both extremes seek to answer questions using systematic procedures, collecting evidence, and examining findings. However, where a research project falls on the continuum is determined by its objective(s), the types of research questions posed, the methodologies employed, and the types of data produced. Expectations for researchers differ as well. Quantitative research tends to have as an objective obtaining generalizable findings. Questions posed tend to focus on associations between variables, methodologies employed tend to involve studying specific variables, and data produced tends to be numerical. Quantitative investigators are typically expected to adhere to scientifically validated research models and techniques and to remain objective. Qualitative research, on the other hand, tends to have as an objective exploring the dynamics of a specific context. Questions posed tend to focus on understanding factors that influence outcomes, methodologies employed tend to be open-ended and exploratory, and data produced tends to be textual (recorded observations, participant comments, etc.). Qualitative investigators are typically considered free to adapt their research models to fit the changing circumstances of their investigations and to render subjective interpretations. Randomized controlled trials represent the quantitative end of the spectrum. In a nutshell, such investigations provide an intervention with a randomly selected experimental group and compare the results to those of a control group (a group that does not receive the intervention). Here are some grouping models commonly used in randomized controlled trials: 6 Parallel grouping: Each student is randomly assigned to an experimental or control group; the experimental group receives the intervention and the control group does not – but may receive a dummy/placebo intervention, to minimize the potential for participant expectations to influence the outcomes. (In educational research, a placebo may be, for example, an instructional method that has already been shown to not produce the effect being researched.) Cluster grouping: Pre-existing groups of students (such as course sections) are randomly selected to receive or not receive the intervention Crossover grouping: Each student receives or does not receive the intervention in a random sequence Factorial grouping: Used when more than one intervention is tested simultaneously to determine whether there is a compound effect; each student is randomly assigned a group that receives a particular combination of interventions (AA, AB, AC, & BC, where A = no intervention, B = intervention 1, and C = Intervention 2) A common crossover approach is to switch the treatment and non-treatment groups at the midpoint of the study. Crossover grouping minimizes the potential for group differences to influence the findings. In educational research, the effectiveness of a cross-over approach depends in part upon whether the intervention being studied is influenced by timing. A Source of Some Confusion The word quantitative may sometimes seem a little slippery because people use it differently depending on whether they are describing data, measurement methods, or research design. Quantitative data is information derived from direct numerical measurement. Height, weight, area, pounds per square inch, counts of students, proportions of responses, and GPA are examples. Representing something as a number when it is not directly, numerically measurable (such as pain rated on a scale of 1 to 10 or level of effort on a scale of 0 to 4) does not make it quantitative data. It is a common error to refer to qualitative data that has been numerically represented as quantitative. Readers will find disagreement in the literature about the classification of measurement methods as quantitative versus qualitative. Applying the same distinction used for classifying data, quantitative measurement methods are those that involve direct, numerical measures. However, common usage lumps together all measurement methods that produce numerical data (especially when conducted in systematic ways) as quantitative measurement methods. As a result, Likert-scale questionnaire items and rubric-based evaluations are commonly referred to as quantitative methods even though the data they comprise is qualitative data. At the research design level, quantitative essentially means the researchers intend to use inferential statistics to make generalizations beyond the study sample. For example, let’s say a foreign language instructor randomly selects half of her students to participate in an emersion program during the first half of the semester only and then has the other half of the students participate in the emersion program during the second half of the semester. The timing of the emersion program could turn out to be a significant factor. Students who develop basic vocabulary and syntactic awareness through classroom instruction during the first eight weeks may have a decoding advantage when they enter the emersion program. On the other hand, students who experience the emersion program during the first eight weeks may develop greater awareness of cadence, pronunciation, and speech patterns, and applying these, advance more effectively through the conventional instruction during the second eight weeks. Anticipating these possibilities, the instructor may decide to include in the design a third group, comprised of students who do not participate in the emersion program at all. A subset of randomized controlled trials, called randomized double-blind placebo-controlled crossover studies, is generally considered the ‘gold standard’ in quantitative scientific research (particularly pharmaceutical research). And, while it need not be the goal of every educational researcher to conduct such a study, an understanding of the rationale supporting the design will help inform design decisions. Double-blind means neither the researchers in contact with the subjects nor the subjects know who is receiving the ‘treatment’ and who is not. This minimizes the potential for either the researchers or the subjects to influence outcomes based on their expectations. In educational research, conducing a double-blind study would require that the educational intervention be provided by someone other than the investigator. 7 For example, let’s say a researcher wants to find out whether students approached outside of class by a peer and asked to explain core course concepts perform significantly better on subsequent exams than students not questioned in this way. To make this a double-blind study, the students must have no inkling that being questioned by a peer is an intervention, and the researcher must remain ignorant regarding which students are questioned until after the data collection is complete. At the other end of the spectrum, qualitative research usually seeks to answer a question related to the perspectives of an involved population. Qualitative research can be particularly useful in studying attitudes, values, opinions, perceptions, beliefs, behaviors, emotions, relationships, social situations, experiences, and other aspects of individual, community, or cultural contexts. Although insights gained may have the potential for extension to similar populations or situations, qualitative research usually focuses more on delving into the factors that influence outcomes within the specific study population than on being able to generalize the findings to other populations. Qualitative research explores and interprets how people experience the complex reality of given issues and/or situations. For example, imagine CNM researchers want to better understand why area employers within a specific sector do not hire as many qualified CNM alumni as might be expected on the basis of job openings. The objective of their investigation will be to identify factors influencing the decisions of the local employers. Their findings might ultimately reveal something about employer perceptions that could be useful, by extension, to other CNM programs or other colleges with similar programs. However, their primary goal is not finding something that can help other programs, but rather, better understanding what their students’ prospective employers want so that the program faculty might better address the preparation of candidates. Research within the field of education is often driven by context-specific questions posed by faculty and therefore tends toward qualitative design. Indeed, large-scale quantitative educational studies are relatively rare, and generalizations made on their basis tend to be subject to dismissal because of the difficulties inherent in applying scientific models to studies of humans that take place in natural settings, especially when the conclusions contradict strongly-held convictions about teaching and learning. (For an illustrative example, consider Project Follow Through, “the largest, most expensive educational experiment ever conducted” [Adams, 1996].) Because investigators conducting qualitative research typically care more about the usefulness of the information gathered in supporting interpretation and insight than about the reproducibility of the study, qualitative research tends to be characterized by adjustability and/or fluidity. The design needs to facilitate collection of the desired information, not stand up to scientific scrutiny. Stated another way, validity of the findings is important in qualitative research, but reliability is not. Observation, interviews, and focus groups are methods commonly used in qualitative research. The data produced may be a mix of direct and indirect measurements and/or textual information with no associated numeric values (such as recorded dialogs, comments, field notes, responses to open-ended questions, etc.). Input using categorization, holistic ratings, Likert scales, rubrics, and/or forced-choice responses may be numerically represented to facilitate statistical analysis. Note that the use of statistics is not the exclusive domain of quantitative research; however, while quantitative research tends to employ inferential statistics, qualitative research tends to incorporate descriptive statistics. These categories are explored further in the Statistics section of this handbook. 8 Sampling Methods A population is all of a group of interest (in New Mexico, the whole enchilada). A sample is a subset of the population being studied (in New Mexico, some of the enchilada). Sampling can make it feasible to implement research methods that would otherwise be impossible or too time consuming to consider. Sampling methods fall into two broad categories: those that offer the potential to support inferences about the entire population (probability methods) and those that do not (non-probability methods). Probability methods are typically used with quantitative research designs and inferential statistics; whereas, non-probability methods are more often (but not always) used with qualitative designs and descriptive statistics. If you want to be relatively sure that the results you obtain from your sampled students are generalizable to the population of students in your course and/or program, consider using a probability method, such as simple, systematic, stratified, or clustered random sampling. Below are some examples: Simple random sample: Drawing names from a hat (For the sample to be truly random, all of the possible names must be in the hat to begin with. If you systematically leave out any group of people [e.g., students who are failing the course], then your results will not be generalizable to that group.) Systematic random sample: Counting off every 5th student who comes through the door Stratified sample: Randomly selecting 10 first-year students, 10 second-year students, and 10 recent graduates Cluster sample: Randomly selecting 6 students from each of five sections of the same course To be confident in your inferences, you will want to select a sample that is big enough to be representative of the population you are studying. Consider using an online sample-size calculator, such as the one provided by Creative Research Systems at http://www.surveysystem.com/sscalc.htm, to find out how many students you need to include in your sample to achieve the level of statistical reliability you want. Then, you can use an online random integer generator, such as the one at http://www.random.org/integers/, to get numbers that you can apply to a numbered list of all students in the research ‘population’ to create your random sample. On the other hand, if you don’t care about generalizing your sample’s results to a larger population, you might prefer to use non-probability methods such as convenience, purposive, snowball, quota, or theoretical sampling. Just keep in mind that if you choose any of these, the statistics you obtain will not support claims that your findings can be generalized to the entire population. Following are some examples: Convenience sample: Using your own class as a sample Purposive sample: Asking all of the African American students in your program to participate in a focus group to provide feedback regarding their perceptions of cultural inclusiveness within the program Snowball sample: Interviewing a few people and asking them to refer their friends to you for interviews Quota sampling: Selecting 200 students in such a way that your sample has essentially the same proportionate representation of ethnic groups seen in your program Theoretical sample: Questioning a few students about something you’re considering studying further 9 STATISTICS Depending on the type of assessment/research you conduct, you may or may not need to include statistical analyses. This final section is included for those who do plan to use statistics but are not expert statisticians. If you are planning to conduct a formal study, even an in-house one, and are not a statistician, consider consulting with someone who is a statistician during the planning phase, before you begin collecting any data. This will help ensure that the data collected will be able to provide the information you seek and shed useful light on the research topic. Also, consider including a statistician in your plans for conducting the statistical analysis once you have collected the data. Data Types Although as a branch of mathematics statistics involves numbers, many types of information can used for purposes of statistical analysis. Before selecting a method of analysis, it is important to understand what type of data you have. And, considering the method of analysis prior to conducting a study can help to ensure that the data collected will provide the desired functionality. There are four main types of data: nominal, ordinal, interval, and ratio: Nominal: Numbers or words are used to label, classify, or categorize. For example, each student may be represented by a student ID number. Or, maybe students are grouped according to a characteristic, such as gender, with 1 representing male and 2 representing female. The numbers or words are representations, not measures, and they have no inherent order (1 could just as easily represent female and 2 male), so they cannot be used in mathematical calculations. However, nominal data can be used to indicate similarities and differences. Ordinal: Numbers or words are used to categorize information on the basis of order. Observations are rank-ordered on the basis of some criterion, but the intervals between the rankings are not necessarily equal. For example, a postgraduate degree is higher than a bachelor’s degree, which is higher than an associate’s degree, which is higher than a high school diploma. Likert-item responses, letter grades, and rubric scores are other examples of ordinal data. Ordinal data is used to indicate more than or less than relationships. Interval: Numbers represent equal intervals between points on a scale, but the scale lacks an absolute zero. IQ scores, calendar years (e.g. 1983 and 2012), and temperatures are examples. Ratio: Numbers represent equal intervals between points on a scale, and the scale has an absolute zero, which allows measures of ratio to be taken. We can say that one item is two-thirds the size of another or that one student took twice as long as another to complete a task. Height, weight, time intervals, and counts are additional examples. Analytic Approaches Statistical analyses are often classified as parametric or nonparametric: Parametric analyses, associated primarily with inferential statistics, can be used with interval or ratio data that are normally distributed when the variance within the two groups being compared is the same or similar. Parametric tests include t-tests and analysis of variance (ANOVA). Longitudinal and cross-sectional studies typically employ parametric analyses. 10 Nonparametric analyses, associated more with descriptive statistics, can be used with nominal and ordinal data. They are also useful for interval and ratio data that are not normally distributed and/or do not have variances similar to those of the comparison populations. Chi square tests are non-parametric. The decision chart below can be used to identify the statistical tests most applicable to your research project. Figure 1: Statistical Test Decision-Making Flowchart Adapted from Ruth, D., Practical Statistics for Educators, p. 227 You may have noticed that the flowchart above does not include any guidance for the use of ordinal data. This is because statistical applications using ordinal data must take into account the degree to which the intervals between can be assumed to be equal. In some cases, such as the example of educational attainment given in the previous section, ordinal data is essentially nominal data with some order to it. While one might assign numbers to progressive educational awards (e.g., 1 for High School, 2 for Associate’s, 3 for Bachelor’s, and 4 for Postgrad), it would make no sense to calculate a mean on the data (or to conduct mean comparisons using an independent samples t-test). However, it might make sense to calculate the mode or to calculate median income associated with each educational level. 11 In other cases, when the ranking strongly suggests a degree of continuity across intervals, ordinal data may be treated more like interval data. For example, it is common practice, though not all agree it should be, to calculate mean responses from Likert items. One sees this more in the social sciences and in qualitative research than in the fields of math and science or in quantitative research. Correlation studies are commonly used in educational research because they are very helpful in determining whether a relationship exists and is worth exploring further. Fortunately, or unfortunately depending on how much you love statistics, a variety of correlation coefficients are available for computation. The Pearson correlation coefficient (a.k.a. the Pearson product-moment correlation, or just Pearson for short) is most widely used, and indeed some analytic software programs offer only the Pearson. Generally speaking, however, it is the type of data that determines which correlation study is most appropriate. Because the differences could be important if you intend to publish your research, the following differentiations are provided: Pearson is a nonparametric test and can be used when the data is not normally distributed and/or when you want to see if there is a relationship between two interval or ratio variables. As indicated above, however, Pearson is commonly used with normally distributed data as well. Spearman’s rho (Spearman for short) is a parametric test, intended for use with normally distributed data. You can use it to compute the correlation between two ordinal variables. However, the Spearman is most appropriately used with ordinal data for which the ranking is highly detailed and/or strongly suggests a degree of continuity across intervals. In other words, Spearman is good for analyzing ordinal data that closely resembles interval data. For ordinal data that is more like nominal data, having five or fewer rankings with some order to them and/or having intervals that are not really approximately equal (such as Likert-item responses), a Gamma correlation may be more appropriate. Point-biserial correlation studies can be used to analyze dichotomous data. Note, however, that only one of the variables can be dichotomous. Some data is naturally dichotomous, such as gender or Yes/No responses. And, sometimes people artificially dichotomize data, particularly ordinal data (beware however that some people frown on this practice). A common example of artificial dichotomization is taking Likert-items and putting the “Agree” and “Strongly Agree” responses into one category while putting the “Disagree” and “Strongly Disagree” responses in another. Another example would be putting all of the students who passed with grades of C or better into one category and all the students who got D’s and F’s into another. Refresher of Basic Concepts The following list highlights some key statistical concepts. Additional definitions are provided in the glossary. Far more extensive glossaries of statistical terminology can be accessed online at sites such as statistics.com and stat.berkeley.edu. Alternative Hypothesis: Typically a statement made prior to a study asserting (hypothesizing) that there is or will be an observable effect or difference. (Contrast with Null Hypothesis.) Averages: 3 main measures of central tendency (though there are others) Mean: The arithmetic calculation of the sum of the numbers divided by the count of the numbers. This is what most people mean (no pun intended) when they say “average.” It is the balance point of the data. (A deviation is the difference between a data value and the mean. With the mean as the balance point, the sum of the deviations above the mean will cancel the sum of the deviations below the mean.) 12 Median: The middle value between the highest and the lowest of the numbers when the set is arranged in order. For example, if the set is {2, 3, 3, 5, 6, 8, 9}, the median is 5. If you have an even count of numbers, the median is the sum of the two in the middle divided by 2. For example, if the set is {1, 1, 3, 5, 5, 6}, the median is calculated (3+5) / 2 = 4. Mode: The most frequently occurring of the numbers. In the set {2, 3, 3, 5, 6, 8, 9}, the mode is 3 because that is the only number that occurs twice. A set can have more than one mode, as in {1, 1, 3, 5, 5, 6}, and if no number occurs more than once, there is no mode at all. The mode is considered a measure of central tendency because in a moreor-less normally distributed population, the mode would tend to be at or near the middle, as would the mean and median. Correlation: A procedure used to quantify the relationship between two quantitative variables. In a positive correlation, an increase in one variable corresponds to an increase in the other. In a negative correlation, an increase in one corresponds to a decrease in the other. Correlations are represented using correlation coefficients, ranging between -1 and 1, with zero indicating no correlation at all and the strength of the correlation increasing toward each end of the range. A correlation of 1 would suggest that an increase in one variable corresponds to an equivalent increase in the other, a phenomenon that rarely seen in educational research. In interpreting correlations, it is important to keep in mind that they do not necessarily demonstrate causal relationships. Both variables could potentially be influenced by other variables. For example, ACT scores correlate positively with college completion rates, but having higher ACT scores does not make students more likely to finish college. Both are influenced by prior educational experiences, socio-economic status, parental education, cultural/family expectations, etc. Data Plots: Graphic representations of data sets, often taking the form of scatterplots, histograms (similar to bar graphs), or box plots (a.k.a. box and whisker charts). Effect Size: A calculation that quantifies the magnitude of a difference between two groups. The effect size is a measure of how far apart the true value of the parameter being tested and the one assumed in the null hypothesis are. Failing to detect the difference is a type II error (interpreting the data as not showing a difference when there really is one). Sometimes effect size is presented instead of (or in addition to) calculations of statistical significance, in which very large sample sizes tend to cause relatively minor differences to be flagged as significant. Effect size is an indicator of whether the statistically significant information is of practical significance. Different formulas can be used to calculate effect size, but one way is to divide the difference between the two group means by the standard deviation (of either group or of the two groups averaged). Basically, this tells you how many standard deviations’ difference there is between the two means. For example, an effect size of .70 tells you that the difference between the means represents approximately 7/10 of a standard deviation. Some would consider this a large effect size. However, different scales are used for interpretation, depending on the field of study. Some interpret .25 as small, .50 as medium, and .80 as large. Others consider .10 small, .30 medium, and .50 large. 13 Hypothesis Testing: A formal process of deciding whether to reject or not to reject a null hypothesis based on the results of a study. Level of Significance & Confidence Level: The level of significance is referred to as the alpha level. It is used in hypothesis testing and should be chosen before the study is conducted. The level of significance quantifies the investigator’s willingness to make a type I error (to interpret the data as showing a difference when one truly does not exist) by identifying what p-value will be considered the ‘bar’ (or cutoff) for statistical significance. After the data is collected, a p-value is calculated from the sample, and if it is smaller than the level of significance, then the null hypothesis is rejected (and the relationship deemed to be statistically significant). Confidence Level is the likelihood that the corresponding confidence interval is one of the intervals that contains the parameter being estimated. The confidence level is one minus the level of significance. A confidence interval can be interpreted as a two-sided or two-tailed hypothesis test at the alpha level of significance. Normal Curve: A frequency plot showing the distribution of a characteristic over a population. Also known as a bell curve, the graphic representation is bell-shaped, with the majority of people clustering near the mean. A perfectly normal distribution is one in which mean, median, and mode all have the same value and the tails (lowest and highest ends) remain slightly above the horizontal axis, as shown in Figure 2. Figure 2: Normal Distribution Image retrieved from http://www.lesn.appstate.edu/olson/normal_curve.htm Null Hypothesis: Specifies a value for a parameter. Usually, the choice is based on there being no relationship or effect, no difference between groups, or no difference from a previous value. Like the 14 assumption of innocence until proven guilty, the null hypothesis places the burden upon the investigator to show with significant evidence, beyond reasonable doubt, that a difference exists. Outliers: Data points that are significantly above or below the trend or deviate from the pattern. The following scatterplot illustrates by pointing out two outliers: Figure 3: Illustration of Outliers Power: An estimate of the sensitivity of the test in detecting a difference. The power of a test is the chance that the test correctly rejects a null hypothesis, i.e., the chance of not making a type II error. p-Value: The “p” in p-value stands for probability. A p-value represents the probability that an experimental result is due to a random occurrence under a true null hypothesis. (It is essentially the opposite of the probability of the phenomenon occurring in any given case, which is known as the confidence interval.) P-values are presented with tests of statistical significance to communicate the level of reliability of the statistical conclusions. The smaller the p-value, the greater the reliability of the statistical significance. P-values are usually derived from the “Sig” (significance) data in analytical output tables and are written as “p ≤ 0.01,” which equates to a 99% confidence interval (i.e., a 99% or greater probability of the relationship holding true), or as “p ≤ 0.05,” which corresponds to a 95% confidence interval. Data with a p-value > 0.05 is usually interpreted as not rising to the level of statistical significance. Standard Deviation (SD): A measure of how spread out (dispersed) the numbers are from the mean, symbolized as σ (lower case sigma) for the population standard deviation and “s” for the sample standard deviation. The more closely grouped the numbers are around the mean, the smaller the SD will be. SD is expressed in the same units used to express the mean, so it makes it relatively easy for people to understand just how big the spread is above and below the mean. SD is the square root of the mean of the squared differences from the mean value, i.e., the square root of the variance. Other measures of the spread: Standard Error is used when the standard deviation for an experiment is estimated from its data. Although the sample SD will appear in it, the correct form will depend on the experiment. Margin of Error is the product of a probability factor, to insert the confidence, and the standard error as a scale factor based on the data. It is the number behind the plus or minus in a confidence interval and represents half the width of the interval. For example, with 95% confidence, the true mean μ is thought to lie on the interval 5.25 ± .37. 15 Type I and Type II Errors: Variance: A type I error is a false positive where a true null hypothesis is rejected. A type II error is a false negative where a false null hypothesis is not rejected. Variance is way of measuring how spread out (or dispersed) the numbers are from the mean. The obvious measure of the spread would be the mean of the deviations (the differences between the data values and the mean); however, the mean is the balance point of the data, so the sum of the deviations above the mean cancel those below, and the mean deviation therefore is always zero. To avoid this problem, the deviations are squared. Variance, then, is the mean of the squared deviations. This result is the variance for the population. There is an inherent error caused by sampling. This is corrected for by dividing by one less than the number of data values to obtain the variance for the experimental sample. Analytic Software Programs Open-source and public-domain platforms and online calculators are readily available to support most analytic needs. Here are just a few examples: Dataplot: National Institute of Standards and Technology PSPP: The Free Technology Academy Statistical Calculators: MathPortal.org Sample Size Calculator: National Statistical Service Sample Size Calculator: Creative Research Systems Preliminary Data Analysis Tips Clean up your data: Before beginning an in-depth analysis, look through the data for typos and errors in formatting. If you are using Excel, be sure to format the cells consistently throughout each field. Look for number fields formatted as text, and check decimal settings, time and date formatting, etc. If you are using PSPP, SPSS, or some other statistics software, be sure the “Variables” view has the correct data type selected. Scan the data for things like ID numbers with too many or too few digits or income figures with inconsistent formatting (ranges, dollar signs, text, etc.). Consider sorting the records in ascending or descending order by each data field to make errors more apparent. Develop some first impressions: Look for general trends and/or groupings and anything else that stands out. Often it is helpful to look at the data of interest in a scatter plot before running tests on it to see if any linear or grouping trends are apparent, to check for outliers, etc. Having familiarized yourself with the data in this way will stimulate ideas about what to explore and will help you interpret the output when you run your analyses. Check to see if the requirements for the model you chose are met: For example, for a nearly normal model, look for a single peak and symmetry in the data using a histogram or a normal probability plot. For a linear regression, check for linearity using a scatterplot. After fitting the regression line to the data, look for structure in the residuals that indicate problems in the original linear model. 16 GLOSSARY ANOVA Case Studies Chi Square (χ2) Content Validity Construct Construct Validity Correlation Cross-Sectional Analyses Descriptive Statistics Empirical Evidence Hawthorne Effect Inferential Statistics Internal Validity Inter-Rater Reliability Interval Data Longitudinal Analysis Mean Median Analysis of Variance, a parametric test used to compare the means of two or more independent samples and check for statistically significant differences. Anecdotal reports prepared by trained observers. When numerous case studies are analyzed together, the findings are often considered empirical evidence. A non-parametric test used to compare the frequencies of counts, using nominal or ordinal data, to expected frequencies. The Goodness of Fit Chi Square is used for one independent variable, the Chi Square Test of Independence for two independent variables. The degree to which an instrument actually measures the content it is intended to measure. A theoretical entity or concept such as aptitude, creativity, engagement, intelligence, interest, or motivation. Constructs cannot be directly measured, so we assess indicators of their existence. For example, IQ tests assess prior learning as an indicator of intelligence. The ability of a research approach to actually represent the construct it purports to address. Since constructs cannot be directly measured, construct validity depends upon how much people can agree upon a definition of the construct and how truly the indicators represent that definition. A process of quantifying a relationship between two quantitative variables in which an increase in one is associated with an increase in the other (a positive correlation) or a decrease in the other (a negative correlation). Studies comparing, at a single point in time, people representing different stages of development, for the purpose of drawing inferences about how people change/develop over time. Tests used to quantitatively describe phenomena. (Compare nonparametric analyses. Contrast inferential statistics.) Usually refers to what has been systematically observed. May or may not be obtained through scientific experimentation. A phenomenon in which the behavior of participants in a study is affected by their knowing they are being studied. Tests used to quantify the degree to which generalizations can be made to a population based on outcomes observed in a sample. (Compare parametric analyses. Contrast descriptive statistics.) The extent to which variables in a study can be controlled sufficiently to ensure that any observed change is actually attributable to the intervention. The degree of consistency in ratings among different evaluators. For example, if five judges score a performance with nearly identical ratings, the inter-rater reliability is said to be high. Numbers on a scale comprised of equal intervals but having no absolute zero. A comparison of outcomes over time within a given group, based on changes at an individual level (versus differences in group averages). E.g., comparing pretest and post-test scores to assess learning gains. The average, arithmetic mean, derived by dividing the sum of all scores by the total number (count) of scores. It is the balance point of the data. The midpoint, were 50% of scores are above it and 50% are below it (in position, not value) when all of the scores are ranked in ascending order. 17 Mode Nominal Data Nonparametric Analyses Null Hypothesis Ordinal Data Outliers Parametric Analyses Qualitative Evidence Quantitative Evidence Randomized, Controlled Trials (RCTs) Ratio Data Reliability Standard Deviation (SD) t-Test Validity The most frequently occurring score. Numbers (usually) represent individuals or categories with no associated order. E.g., numbers assigned to ethnic groups or student ID numbers. Studies using nominal and/or ordinal data or any data type that does not follow a normal distribution and/or does not have variances similar to those of the comparison population. Nonparametric tests do not support population inferences. They are used primarily in qualitative research. A statement to be tested through systematic investigation, asserting that no relationship or effect exists between the variables being studied, that there is no difference between groups, or that there is no difference from a previous value. In scientific research, the assumption is that no claims can be made regarding relationships between variables unless significant evidence exists to reject the null hypothesis. Even so, a cause-and-effect relationship cannot be established. Numbers (usually) represent observations that have an underlying order or magnitude but the intervals cannot be assumed to be equal. Data points that are noticeably outside the pattern of the point distribution. Studies using interval or ratio data that are more-or-less normally distributed when the variance within the two groups being compared is the same or similar. Parametric analyses can be used to support generalizations/inferences. (Compare inferential statistics.) Evidence that cannot be directly measured, such as people’s perceptions, valuations, opinions, effort levels, behaviors, etc. Many people mistakenly refer to qualitative evidence that has been assigned numerical representations (e.g., Likert-scale ratings) as quantitative data. The difference is that the qualitative evidence is inherently non-numerical. Evidence that occurs in traditionally established, measurable units, such as counts of students, grade point averages, percentages of questions correctly answered, speed of task completion, etc. A model for scientific experimentation. Participants are randomly selected and divided into groups, experimental and control, for outcome comparison. Numbers on a scale comprised of equal intervals and having an absolute zero. Consistency in outcomes over time or under differing circumstances. For example, if a student challenges a placement exam repeatedly without doing any study in between and keeps getting essentially the same result, the test has a high level of reliability. A measure of the spread of scores about the mean. SD is the square root of the variance. There are two, one for a population and one for a sample; these are calculated slightly differently. Originally derived for small samples, current practice is to use this distribution when the population standard deviation is not known. The ability of a research approach to measure what it is intended to measure. An aptitude test in which cultural expectations or socio-economic level significantly influence outcomes may have low validity. 18 REFERENCES Adams, G., & Engelmann, S. (1996). Research on direct instruction. Seattle, WA: Educational Achiebement Systems. Grossen, B. (1996). The story behind Project Follow Through. Effective school practices, Vol. 15, 1. Retrieved from http://darkwing.uoregon.edu/~adiep/ft/grossen.htm. Institutional Review Board (2004). CNM institutional review board [PDF document]. Central New Mexico Community College. Retrieved from http://www.cnm.edu/depts/planning/instres/irb/documents/CNM_Institutional_Review_Board.pdf National Institute of Health Office of Extramural Research. Protecting human research participants. [IRB training PDF]. Retrieved from http://phrp.nihtraining.com/ U.S. Department of Health and Human Services (1993). Institutional review board guidebook. Retrieved from http://www.hhs.gov/ohrp/archive/irb/irb_chapter1.htm. Schneider, W. H. The establishment of institutional review boards in the U.S. background history. Indiana University-Purdue University Indianapolis History Department. Retrieved from http://www.iupui.edu/~histwhs/G504.dir/irbhist.html. 19 APPENDIX A: RECOMMENDATIONS FROM CNM’S IRB https://www.cnm.edu/depts/planning/instres/irb/documents/RECOMMENDATIONS-FOR-RESEARCH.pdf 20 APPENDIX B: PREVIEW OF CNM IRB APPLICATION FORM 21 22 23 24 25 26 http://www.cnm.edu/depts/planning/instres/irb 27