Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Reliability and Validity Q560: Experimental Methods in Cognitive Science Lecture 16 Psychological Measurement Psychometric theory, sources of measurement noise Cognitive Construct: The label given to our hypothetical characteristic (e.g., attention, STM-capacity, executive function, intelligence, forgetting, etc.) Dimensions along which Ss can be located based on behavior Cannot directly measure construct, so we link it to behavior believed to reflect the construct Operational Definition: Statement specifies how a construct is measured Link between the overt and latent variables Reliability and Validity Reliability and Validity are important concepts in both measurement and the full experiment to insure the link between our statistics and conclusions is sound. Reliability = “consistency” Validity = “on targetness” Reliability is a necessary but insufficient condition for validity Reliability Reliability is the extent to which the measurements of a test remain consistent over repeated tests of the same subject under identical conditions. An experiment is reliable if it yields consistent results of the same measure. It is unreliable if repeated measurements give different results. Reliable car, repeatability, etc. Reliability of experiment, replication, and error rate Reliability does not imply validity Validity Validity of a measure is the degree to which the variable measures what it is intended to measure e.g., IQ tests, GREs, Eye tests A valid measure is reliable A reliable measure is not necessarily valid Unreliable and Invalid Shooting: Sam “Scattershot” Wilson Reliable but Invalid Shooting: Ralph “Rightpull” Roberts Reliable and Valid Shooting: Kit “Bullseye” Carson Experimental Validity Valid design is necessary for valid scientific conclusions Crib sheet: www.indiana.edu/~clcl/Q560/Validity.pdf 1. Statistical Conclusion Validity: Validity with which statements about the association of two variables can be made based on statistical tests Threats: measurement Rxx, statistical assumptions 2. Construct Validity: Validity with which we can make generalizations about higher-order constructs from the experimental results Threats: vague operational def (Watson); experimenter/participant expectancy effects Experimental Validity 3. Internal Validity: Validity with which statements about the causal relationship between variables as manipulated Threats: Confounds (history, maturation, testing), instrumentation, statistical regression, mortality, etc. 4. External Validity: Validity with which we can make generalizations from sample/expt Ecological validity Threats: Interactions of setting/selection method and treatment Dr. N. Lewis is interested in whether memory encoding is stronger for pictures of objects or for words that refer to the same objects. He has participants learn a list of 30 written words that refer to objects (then a distracting task) and then recall as many words as they can. Next, he gives the same participants a list of 30 pictures of the same objects and (after the same distracting task) has them again recall as many words as they can. At the end of the experiment, participants recalled a mean of 16 words in the written condition vs. a mean of 24 words in the pictures condition. 1. 2. 3. 4. 5. 6. What type of study is this? What is the independent variable? What is the dependent variable? Is the dependent variable discrete or continuous? What is the scale of measurement for the dependent variable? Name one confounding variable. A researcher is interested in whether or not snakes can detect insults. He buys 20 exotic snakes, and separates them into two groups of 10. For one group of snakes, he insults them for 10 minutes each. For the other group, he simply stares at them for 10 minutes each. For each group, he records the number of times the snakes bite him (assuming that a bite indicates that the snake took offense to him). At the end of the experiment, the group of snakes he insulted had bit him 23 times vs. only 8 bites from the group he did not insult. 1. 2. 3. 4. 5. 6. What type of study is this? What is the independent variable? What is the dependent variable? Is the dependent variable discrete or continuous? What is the scale of measurement for the dependent variable? Name one confounding variable. Article Critique Assignment #5: Article critique Small-N Designs Small-N Designs Why would we want to do an experiment with a small number of subjects? • B/c it’s easier/faster than doing an experiment with a large number of Ss? Small-N Designs Why would we want to do an experiment with a small number of subjects? • B/c it’s easier/faster than doing an experiment with a large number of Ss? This is untrue…small N experiments are frequently more difficult and time consuming, even though there are only 1..5 participants Small-N Designs • In a small N design, we make repeated measurements on a small number of participants • Although there are fewer participants, there are many more observations per participant • These experiments can be many hours of testing consisting of several thousand trials • The goal is to provide a complete and accurate description of a single subject’s behavioral changes as a function of a repeated measure • Other subjects are replications (separate expts) • The experimenter is often a subject Why would we want to do this? 1. Practical Reasons: A small N design may be necessary b/c it is difficult to get Ss from a rare population (e.g., OCD, Alzheimer’s), the treatments may be expensive or time consuming (e.g., teaching sign language to a chimp: Patterson & Linden, 1981, spent 10 years doing this) 2. Theoretical Reasons: Skinner believed that “the best way to understand behavior is to study single individuals intensely” …one should “study a single subject for a thousand hours rather than a thousand subjects for an hour” (1966, p.21) “If conditions are precisely controlled, then orderly and predictable behavior will follow” Why would we want to do this? 3. Methodological Reasons: Pooling or averaging data from many subjects can produce misleading results as an artifact of grouping (Sidman, 1960) • Averaging can produce results that do not characterize any subject who participated. More importantly, it can produce a result supporting theory X when perhaps it shouldn’t (Estes) • E.g.: Manis (1971) tested children in a discrimination learning task…stimuli were simple objects, and your will have to learn which feature is the diagnotic one (shape, color, position, etc.) Trial 1: + Trial 2: + Trial 3: + Trial 4: + Trial 5: + Trial 6: + Continuity Theory: Concept learning is a gradual process of accumulating “habit strength” Noncontinuity Theory: Subjects actively try out different “hypotheses” over trials. While they search for the correct hypothesis, performance is at chance, but once they hit the correct hypothesis, the performance shoots up to 100% and stays there Perfect Performance Chance Trials Here are the averaged data. Continuity is right!! But here are the individual data before averaging. Discontinuity is right?! CogSci Began with Small-N Research • Ebbinghaus, Wundt, Dressler, Thorndike • It wasn’t until the 1930s that experiments with large numbers of participants and aggregate statistics became commonplace (largely due to Fisher) • Psychophysics is still dominated by Small-N designs. Idea is that we have very high similarity between our perceptual systems; with sufficient control, a stable effect should be observable without needing many subjects • “An effect that isn’t stable enough to be studied with a small N isn’t worth studying” Elements of Small-N Designs 1. A within-Ss manipulation 2. Target behavior must be operationally defined 3. Establish a baseline of responding/behavior 4. Begin treatment manipulation, and monitor change from baseline How do we analyze this? • Visual inspection • Curve/Trend fitting based on theory • Change from baseline Withdraw Designs • Get measurement of baseline behavior on the DV • Introduce the manipulation…but, a change in responding may be due to history or maturation (AB design) • Return to Baseline: If the change is due to hist or mat, it is unlikely that the behavior will regress when treatment is removed (called an ABA design)….ABAB design is more popular 1. 2. 3. 4. Multiple baselines design Alternating treatments design Changing criterion design Staircase designs Withdraw Designs: • E.g., Does talking to a plant make it grow? (A) Baseline (B) Treatment (A) Baseline Growth in inches Growth of ficus divinicus First three months Second three months Final three months Withdraw Designs: • E.g., Does talking to a plant make it grow? (A) Baseline (B) Treatment (A) Baseline Growth in inches Growth of ficus divinicus First three months Second three months Final three months Small-N Designs and Psychophysics • Psychophysics: the relationship between the physical stimulus and the perceptual reaction to it • Small-N designs are popular in Psychophysics bc: Few Ss are needed b/c of of the similarity between our sensory systems (generlizes) On each trial, data are much less affected by error variance than in questionnaire research (also b/c of laboratory control) Trials are very quick and easy: so why do a 30 min experiment where the data collection only takes 30 seconds? Study one S, others are replications Criticisms Against Small-N Designs 1. External validity: To what extent do these results generalize? 2. Criticized for relying on visual inspection of data instead of statistical analysis (but there are more theory-driven and useful for model fitting) 3. Small-N designs cannot adequately test for interaction effects (interactive designs exist, but are very cumbersome: ABBCBABBCB design, etc.) 4. Due to their operant learning tradition (Skinnerian), they tend to focus on response frequency as a DV, rather than RT, accuracy, habituation, etc.