Download TEST - Psychology Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Resampling (statistics) wikipedia , lookup

Psychometrics wikipedia , lookup

Transcript
Psych. 461 Psychological Testing
2 – 1 – 99
Monday 7 – 9:45pm
We will go over the normal curve a lot. We need to know all the statistics that go
with the normal bell curve including the standard deviations and measures of
central tendency, mean, median and mode. Memorize the diagram on page 78.
Psychometrics is the study of individual differences. Properties include test
norms, reliability and standardization.
Testing for jobs began 2000 years ago in China to see who could serve in a
civil service job. Francis Galton used correlational studies on over 9,000
subjects. This was called “The Brass Instruments Era” because he was trying to
quantify psychological testing using instruments. This was the late 1800’s when
psychology was trying to mature itself as a respectful science. Galton said
intelligence is inherited; it’s genetic. Similar to the book “The Bell Curve.”
The American James Cattell used Galton’s ideas and brought them over to
the U.S. He coined the term mental tests in 1890. If people could sort out
information faster would they be more intelligent? He said “no”, there is no
correlation between intelligence and speed of discrimination.
In Engalnd, Alfred Binet came up with an intelligence test for children. Binet
worked with John Stuart Mill and J.M. Charcot. Which kids could benefit from
the school system. Lewis Terman changed the test in 1916 and the test is now
called the Stanford – Binet test. The test was standardized and made for both
children and adults. This is one of 2 intelligence test still widely used today.
These are individual tests.
In WW1 intelligence tests were given to soldiers to separate them into the
alpha or beta groups. They were given group tests. Thousands of recruits
needed to be tested. It would have been too time consuming to individually test
thousands of recruits.
1
 TAT test: Thematic Apperception Test. A projective test second only to the
Rorschach test. When subjects were shown ambiguous stimuli, the subjects
inadvertently disclose their innermost needs and conflicts. Introduced in
1935.
 MMPI: Minnesota Multiphasic Personality Inventory (1942) A self report
inventory of personality. Revised after 50 years.
 Then in 1969 Arthur Jensen p.278, said there are racial differences in IQ and
that children were not a homogeneous group. The book “The Bell Curve” was
similar to this idea and said IQ is a predictor for a variety of social
pathologies.
1989 – “Lake Wobegone Effect” Every state claimed that their kids were
scoring above the average. But that’s impossible if no group is the “average”.
The teachers were helping the students cheat so they could get good
evaluations for these performance based testing.
TESTING
Testing is for research. This is how we collect information. We choose who
becomes the next doctor or teacher through tests. Testing is big business.
There are employee testings for integrity and personality. Through tests we find
out if the drug and alcohol programs are successful or not; should we continue
to fund them? Testing will determine that.
 Standardized test vs. non-standardized tests. Standardized tests have fixed
directions for administering and for scoring tests. They are constructed by
professionals in their fields. The test results provide what the NORM will be.
We then compare other scores to the norm.
 Individual test vs. group tests. Individual tests are one on one.
 Objective vs. Subjective tests. Objective tests can be done by following the
directions and the results should be the same for all who administered the
tests. Subjective tests are judged by others. They are non-judgmental tests,
like scoring a scantron sheet through a machine. Ex. Essay questions and
the Rorschach test.
 Verbal vs. Nonverbal tests. Verbal tests can be oral tests or
vocabulary/reading ability. Nonverbal tests could be looking at diagrams.
 Cognitive vs. Affective. Cognitive tests measure mental processes and
mental products; The achievement, intelligence and creativity test. Affective
test looks at attitude, values and personality.
2
Normal Bell Curve
 Total area under the curve represents the total # of scores in the distribution.
 The median is in the middle where ½ the scores fall under and ½ fall above.
 The curve encompasses roughly 4 standard deviations in each direction.
Mean, median and mode all fall in the center. Percentage of scores falling
between any two points can be approximated.
1. 68% of all cases fall + or – 1 standard deviation from the mean.
2. 95% of all cases fall + or –2 SD from the mean.
3. 99% of all cases fall + or – 3 SD from the mean.
Percentile ranks can be determined for any score in the distribution.
3
From raw scores we need to transform them to other scores so we can
find a norm to compare them to. A score of 50 means nothing unless you find
out how other people have scored relative to a score of 50. The most
common kind of transformation in psychology is the percentile.
A percentile rank expresses the percentage of scores or percentage of
persons in a relevant reference group who scored below a specific raw
score.
This means that higher scores mean a higher percentile rank but is not to
be confused with the number of correct answers received. Percentile only
tells how one has scored relative to other people; not what their absolute
scores was. Someone could get 50/100 correct and still be within the 90
percentile because others did very poorly.
A Percentile Rank for any given score (PR) = n1 (the total # of scores that are lower
than the score in question /N * (100)
In this given score distribution, what is the PR for the score of 84 and 72?
64,91,83,72,94,72,66,84,87,76.
The answer is the 60th percentile rank.
There are 6 scores below a score of 84 and there are 10 scores total. 6/10 = .6 * 100
= 60.
For a score of 72 the answer is the 20th percentile rank. There are two scores below
72. 2/10=.2 *100 = 20.
Percentiles do not represent equal spaces on a bell curve graph. The
scores are grouped together near the center because this is where everyone
scores at. There are less scores at the ends. So the scores at the end are
spread out more. So percentile scores are not good at telling how far away
another person scored at. However they remain popular.
Linear Transformations do not distort the underlying measurement scale;
Z-scores are standard scores (residual scores). Standard scores express a
4
scores’ distance from the mean in Standard Deviation units. They directly
correspond to each other.
If we know the SD score then we also know its percentile rank. Z-scores
have a mean of 0 and a SD of 1.
If you transform scores into z-scores we get a Standard normal
distribution. A Z-score of +1 = the 84th percentile rank.
(2.14+13.59+34.13+34.13 = 84%)
Z = x (raw score) – M (mean of the distribution) / SD.
 So a raw score of 10 and a mean of 5 and an SD of 5 would equal a z-score of +1.
 But a raw score of 15 and a mean of 5 and an SD of 20 = .5 This would mean he
scored around the 70th percentile rank. Just below the 84th rank.
 Know all the means for each type of test in the text. Page 78.
Standardized scores (Memorize for tests)
 T-scores are used for personality tests such as the MMPI. No negative numbers
are involved. Mean of 50 and an SD of 10.
 IQ scores. Mean of 100. SD of 15 or 16 depending on the test. If you had a zscore of 2 then an IQ score would be 130.
 CEEB (College entrance exam boards) has a mean of 500 and an SD of 100.
Generic formula for all tests:
X1 = (Z) (SD) + M
T = (Z) 10 + 50
IQ = (Z) 15 + 100
CEEB = (Z) 100 + 500
We see more negatively skewed distributions on tests, they’re too easy.
More higher scores than lesser ones. A harder test would have more scores
piled up on the lower end and fewer scores on the higher (right hand) side.
5
Testing 461 /2 – 8 – 99/2nd lecture
We turned in the enrollment forms. Read text chapters 1 – 3. Homework assignments #1, 2 and 3 are
due 2/22/99 at next class meeting.
Comprehensive checklist
 Why do we transform raw scores into standard scores? To understand them.
Raw scores are not meaningful unless they are transformed into interpretable
standard scores.
 What is the advantage of percentile ranks? So everyone can understand
where they stand against others. Not everyone understands where a t-score
statistic means.
 Why is the assumption of normality an important concept in testing and
measurement? Only if we assume a normal bell curve can we apply the rules
of the statistics of the bell curve to our findings.
I. Standardization and Norms
 Standardization and nonstandardization tests – Descriptive data is referred to
as normative data. We look up the raw scores in our texts to see what the
corresponding score is. Standardized tests are tests that provide us with
directions for giving a uniform administration of the test and allow uniform
scoring of the tests. We’re making sure everyone else is taking the same test
in terms of the experience they have in taking that test. Everyone gets the
same environmental stimuli. Reacting to a kid’s behavior will make a
difference to his testing ability. The teacher is introducing a confound to the
test when she reacts positively or negatively to her students.
 Standardization procedures ensure uniformity in administration and scoring;
therefore reducing error.
 Some tests require a lot of training, especially for individual tests as opposed
to group tests; know how to keep accurate time, what the scoring procedures
are and setting up material. Ex. Timing is important for cognitive tests where
time counts for your IQ score. There is a WAAT test (wide range
achievement test) that gives explicit instructions to the tester in how to give it
too kids. Important instructions are bolded and colored.
6






II. Standard Administration
Can the test be used for the disabled. Can the subject hear or can he see
correctly. First look to see for any special instructions for such a subject, if
there are no instructions then that is someone who needs to be referred out
to specialize testing.
Most individual type of tests has a check off list of behaviors to look out for.
An example is the Wechler Test. The list is called Behavioral observations,
the list includes the subjects attitude, physical appearance, attention,
visual/auditory, language – expressive/receptive (what is the dominant
language spoken at home and out of home), affect/mood and unusual
behavior/verbalization. This information is filled out immediately after the test
is taken. This may explain how the subject did; this list doesn’t affect the kid’s
IQ in any way.
Attitudes to the test are critical. Some kids do not want to be tested.
Juveniles from the probation department do not want to be there in the 1st
place. So testers must develop a rapport before giving them the test.
Tell kids whether it is an achievement test or IQ test and tell them some will
be easy and some will be hard.
Make sure what language they speak at home. If English is not their
dominant language they may not understand the test. So it has nothing to do
with their IQ if they perform poorly.
For group tests the subjects all have to hear the same instructions given by
the tester. He also has to make note of any environmental noise going on.
III. NORMS
Norm Group – consists of a sample of examinees who are representative of the
population for whom the test is intended.
 To get a norm value you have to administer the test to a large sample of
people who come from the population for whom the test is intended. This
large group is called the standardization sample or the norm group. If you
want to test kids your norm group will have to be elementary school kids not
adults.
 One of the major goals in standardization is to determine the distribution of
raw scores in the norm group. Need to give the test to enough people from
the target population to allow them to figure out how these people are
expected to perform on this test. The norm group will become the basis of
comparison for all future tests.
 Then we convert the raw scores into norms – they take many forms. They’re
age norms, geographic norms, percentile ranks and standard scores (IQ,
CEEB, stanine, T-scores)
7
IV. Norm Referenced Tests vs. Criterion Referenced Tests
 Norm referenced test is where the examiner’s score is compared to some
referenced group. An IQ test or an achievement test or a personality test. It is
impossible to flunk a norm-referenced test. You can be lower or higher than
the norm group on some trait/subject but you cannot fail the test.
 A criterion-referenced test is where we compare the score to a well-defined
criterion. Ex. a typing test where you have to type a certain # of words per
minute. You have to have some sort of mastery of a skill/knowledge up to a
certain pre-established level/criterion.
V. Number of People needed to form an adequate norm group
 Establishing norms is important. How many people do we need to get a
correct representative sample? 100,000 could be tested for a group norm.
Larger variability in the scores to give a better estimate of what the range of
performance is likely to be in the population.
 To establish the norm for individual testing we would need to test 2,000 –
4,000 for standardization for individual tests. Cannot practically test 100,000
individuals.
8







VI. Selecting Norm Groups
Representative on important characteristics. Ex. ethnicity, gender, education,
socio-economic status.
Year normative data was collected. Different eras could’ve produced different
results. Kindergarten testing was very different 20 years ago than today.
Tests today have them know more language skills than before. So
achievement tests are re-calibrated every 5 – 10 years because students’
norms change over time. We learn to read quicker now, than before.
Supplemental norms – norms for the special group of people. (Mentally
retarded.) Ex. A test for ambulatory and non-ambulatory mentally retarded
people. Shows how they compare to average adults and their own group.
Sampling technique – A simple random sample is where everyone in the
target population has an equal chance of being selected for that sample. Ex.
Names in a hat using a computer but still too time consuming and expensive.
Not practical. So we use a…
Cluster Sampling – Take the target population and divide them into
geographic clusters. Then they randomly choose 10 school districts then
draw 2 high schools, then draw 2 classes and test them. Then the results
would be the standardized norm. Results are groups that are heterogeneous
and representative to the ones who were not chosen.
MMPI – Tested everyone from Minnesota but only in Minnesota. The test
was done in the 1940’s. It took nearly 50 years to re-calibrate the test in
1989. The new scores were similar. Norms should be appropriately gathered
and used.
Stratified random sampling – We reduce the target population to subgroups.
Each group is a strata of gender, age, etc…These are the stratification
variables based on census data. Then the names are drawn randomly from
each strata in proportion to the real number/incidence in the population at
large. Ex. if the population for a certain trait is 15% black then we choose
15% of blacks from our strata sample. Homogenous in each group but
heterogeneous across the sample to give a good representation.
9
After break 8:15pm. Three assignments due next class meeting. Going over the data we turned in last week on
our age and GPA.






VII. Types of Norms
Age of norms – Common to report scores + or – 1D scores. Test babies
more in shorter increment spans than for older kids on developmental traits.
For a values test it stays stable over a number of years. The age span
depends on what it is we are testing and where it is you expect it to find it
differently in the breaks in the age.
Grade norms – popular with achievement tests.
Percentile ranks – Common norms. Easy to understand. Shows how one
compare to others.
Standardized and nonstandardized score norms (Z-scores, T-scores, CEEB
scores) Linear Transformations do not distort the underlying measurement
scale; Z-scores are standard scores (residual scores). Standard scores
express a scores’ distance from the mean in Standard Deviation units. They
directly correspond to each other. Useful to the professional who
understands them, not to the layperson. We know what the score means to
the subject.
Geographical norms – Ex. If a particular school is good for a GATE program
in a certain area. A group intelligent test is used for this; they have to score
above the 95th percentile rank or above. They have to be in the top 5th
percentile.
Expectancy tables – Tables that present data based on large samples of
people who have been accessed on some predictor test and also some
criterion measure/behavior. Such as the ACT predicting college
grades/performance. A normal distribution arises from the ACT table; People
at the lower and higher end are better predictors how they do in college. This
applies to group data not to individuals themselves.
10
Correlation Coefficient
It describes a descriptive and inferential statistic used to describe the strength of a relationship between
two or more variables. The correlation coefficient has a range of values from –1 to +1. The numerical
index indicates the strength and the direction of the relationship between two variables. There can be a
positive or negative relationship or no relationship at all which would be a score of ZERO. The closer
to +1 or –1 the stronger the relationship. A restriction of range for either score that is being correlated
you don’t reduce the magnitude of the correlation. Correlations cannot detect curvilinear relationships.
Correlation does not mean causation.
The Pearson product moment correlation coefficient – r. You have to have
variables that are both measured at, at least the interval level of measurement.
Slope – change in Y for each one-unit change in X. (Rise over Run). The
correlation between two variables can be estimated by estimating the slope of
the line of best fit. So we can think of the correlation of x and y as being equal to
the slope of y regressed on x.
The formula is: rxy=byx (SDx
/ SDy)
What happens if the slope is .5? As x goes up 1 then y goes up a ½.
11
Homework on predicting GPA from our ages.
Plotting age and GPA on the graph. The predictor variable is GPA, which goes
on y-axis.
Age is used to predict GPA.
We always use x to predict y.
 Use as much of the data plot as possible.
 When we are done we will look at the relationship between these two
variables. On the basis of this guess the correlation from the line of best fit.




Comprehension check
What does standardization ensure? The data will be useful because
everyone got the same test.
Name two reasons why tests are standardized? In order to be able to
generalize, to interpret easily, everyone has the same experience taking the
test and to find a basis for comparison. And to Reduce error
What are NORMS? They are the scores we use as the basis for comparison
that have been taken from standardization.
Where do we find the norms for standardized tests? In the test manuel.
12
 What is a criterion referenced test? The test results will be compared to
some predetermined criteria/mastery level.
 If an individual’s raw score corresponded to a z-score of 1.5, what would his
corresponding IQ score be and his percentile rank be? IQ = 122, PR = 91st
percentile.
Test reliability
Quantitative and qualitative information on how good a test is. The degree to
which a test is consistent. An IQ score we got today should be the same as
another one we got a few days later. Correlation between two sets of test
scores for a given group of individuals on the same test. If r = 1 then the scores
are identical; everyone scored the same over a period of time. If it is 0 then the
scores are all over the plot.
The range of values for a reliability coefficient is from zero to one. (0-1.0) There
is no negative numbers; you cannot have a negative reliability.
Reported reliability is an estimate on how free the test is from error in
measurement.
2 types of errors in the field of testing.
 Systematic error – errors in changes that affect everyone’s score exactly the
same way. They will inflate or deflate the scores in some amount. They do
not have affect test reliability, they affect test validity. The consistency of the
measurement is fine but not the validity. Ex. A teacher gives an extra point to
all the scores by mistake. A type of measurement error that arises when,
unknown to the test developer, a test consistently measures something other
than the trait for which it was intended.
 Unsystematic Errors – Affect different people in different ways. Random acts
of chance that affects the test scores. Two primary sources of this:
1. The particular sample of questions that were on the test from the total
population of possible questions. Ex. A student could have taken a test
where he either knew all the answers or none of the answers; there was not
enough variability of questions asked.
2. Time of administration – The environment around the testers and the time
of day the test was given. Some people are morning people and some are
night people. Lighting in the room can affect your scores. Some may or may
not be annoyed by a marching band when taking a test. These are things
that have different effects on different people and they introduce error into
the testing session.
13
This is why strict standardization procedures is required to reduce error. In order
for a test to be reliable it must be relatively free from unsystematic errors.
Classical Test Theory aka Classical Reliability Theory and True Score
Theory
Chapter 3
 Every single score we have seen for an individual on a test is made up of two
components. The first is made up of the individuals true score. That true
score is thought to represent true differences among the people on the trait
being measured.
 The other component is the error score; the random unsystematic error.
Represented by the equation: X = T + e. X is any observed score, T is the
true score and e is error. So a person’s true score is equal to the average
score an individual would’ve obtained if they took the test an infinite number
of times. We can never know a person’s true score.
14
Testing461 /2 – 22 – 99/3rd lecture
Symbols:
 Standard Deviation of the population – sigma
 Sample or SD of the sample –
 Variance –
It is a measure of variability. Variation of the population, It is
the mean of the squared deviation. How much the scores are spread out – if
it is a high value then the scores are spread out a lot is it is a small value
then everyone scored in the same area.
 Correlation – r
 Reliability – r11




Four Assumptions made in Classical Test Theory
Measurement errors are random. We always assume there is going to be
error.
Mean error of measurement = 0
True scores and errors are uncorrelated.
Errors on different tests are uncorrelated. We need the same features/same
test again.
Concept of Reliability – Portion of the observed score variance that is
attributable to actual differences among people. The RANGE is from 0 – 1. You
cannot have less than 0 r11.
15
TYPES OF RELIABILITY ESTIMATES
1. Test – Retest
2. Alternate forms/Parallel forms
3. Internal Reliability
4. Inter rater r11





TEST – RETEST uses ONE FORM in TWO sessions
On the Test – retest method, we need to administer the same test to the
same group of examinees twice in a short period of time.
The TIME INTERVAL must be a few days to a few weeks.
Estimated r11 is the computed correlation between scores on the
administrations of the test.
There is an assumption that that person’s true score did not change in that
brief period of time. Differences in scores assumed to be due to unsystematic
error associated with testing conditions or attitude of examinee. Estimates of
error are due to testing conditions.
There is a possible practice effect or that true differences are due to learning
or maturation if intervals are too long. So a source of error are CHANGES
OVER TIME
PARALLEL/ALTERNATE FORMS
Uses 2 FORMS for One immediate Session
Or if delayed then, 2 FORMS in 2 Sessions
 A different procedure for getting r11 coefficient. Uses two independently
constructed forms of the same test, but assumes to measure the same thing.
It assumes the test forms are equivalent.
 Both forms are administered to same examinees at the same time or
separate occasions.
 Estimated r11 is the computed correlation between the two sets of scores.
 If given at the same time, the error must be due to sample of questions
because that is the only thing different. Or if we use the delayed
administration, then the error is due to both the sample of questions and the
testing conditions.
 Biggest problem with parallel forms is constructing equivalent forms of the
test.
16
INTERNAL CONSISTANCY COEEFICIENTS p.94
Split half method
ONE FORM AND ONE TEST SESSION
 The psychometrician seeks to determine whether the items tend to measure
the same factor.
 We obtain an estimate of split – half reliability by correlating the pairs of
scores obtained from equivalent halves of a test administered only once.
 Sources of error – item sampling and nature of split.
 Split ½ method affected – We treat the test as if it consists of two equivalent
halves. Take the scores on ½ of the test and correlate the scores with the
other half of the test. We assume the two halves of the test as equivalent.
The questions are randomly chosen because a group of questions maybe
too easy or too hard, we need a larger pool so a way to split the test is to go
odd numbered items or even numbers. r11 estimates is only affected randomly by the
sample questions on ½ of the test vs. the other ½ of the test.
THE SPEARMAN – BROWN FORMULA
The split half method gives us an estimate of reliability for an instrument half
as long as the full test. The longer the test, the more reliable that test will be.
Thus, the Pearson r between two halves of a test will usually underestimate the
reliability of the full instrument. The Spearman – Brown formula is a method for
deriving the r11 of the whole test based on the half – test correlation coefficient –
just need to know this concept.
We DO NOT need to know the formula for this.
Most of these tests are not used anymore; there are more generalized tests that are now used today to
maintain the internal consistency of a test. How will ever know we got the best split on a test for
estimating the r11? We never will.
ALPHA FORMULAS
17
It is a general approach to determine internal consistency, which provides the mean r11, for the average
r11, of all possible split ½ combinations.
Cronbach gave us a formula for estimating the internal consistency of a test.
Coefficient Alpha may be thought of as the mean of all possible split – half
coefficients, corrected by the Spearman – Brown formula. Sources of error –
item sampling and test heterogeneity.
If we have high averages then we must have test items that are similar to
one another, the individual test items are homogenous. The test items must
tend to measure a single trait in common, so it will give us a good measure of
this construct. So now we have a high internal consistency. No matter how the
test is split, scores on the two halves almost always correlate to a very high
degree.
The alpha formula gives us an index of the extent to which a test measures a
single trait; but with more traits to measure we have a LOWER coefficient
alphas. The coefficient alpha is a function of all the items on the test AND the
total test score. Each item on the test is conceptualized as a test in and of itself.
We try and estimate the r11 of the whole test by looking at the r11 of each
individual test item.
KUDER – RICHARDSON estimate of R11
There are two more methods for measuring internal consistency. The
formula is recognized as the Kuder – Richardson formula 20 (KR – 20). Used
mainly for dichotomous scoring. Much easier to score. It assumes the items
vary in difficulty. The KR – 20 formula is relevant to the special case where each
test item is scored 0 – 1. (right or wrong). This test is used more often. It is a
more specific form than the coefficient alpha. Shows how much error there is,
the tester interacts with the examinee while another tester does the same but
does not interact with the examinee, she simply collects and observes. When
the test is complete comparisons of the correlations are made to see if there are
discrepancies with the results. What is the degree of agreement? The KR – 20
will tell us the estimate of how much error there is in those subjective scores.
Internal consistency coefficients: The split – half method, alpha coefficient and the KR – 20 are all
OBJECTIVELY scored. There are SUBJECTIVE tests in psychology like IQ tests. We use the
INTERSCORER or INTERRATER reliability coefficient. Know TABLE 3.10 on page 97 for the
TEST. She wants us to know what the tests are and when they are appropriate to use. Also know how
many tests given and how many times.
Inter Scorer Method for estimating Reliability
Some tests have to subjectively judged like a creative or projective test. A
sample of tests is independently scored by two or more examiners and scores
for pairs of examiners are then correlated. One test given at one session.
Sources of error – Scorer differences.
18
What is a good reliability estimate?
Depends on what you want to do with that data. If you just want to compare
two groups and see if there are different, then you need a reliability coefficient of
at least .65. Anything less says the tests are not a good measure. But if you are
playing with someone’s career then you need a reliability coefficient of at least
.85. Cognitive tests are more reliable than affective measures. You already
know what you know, 2 times 2 equals 4, but your mood changes from day to
day.
What are ways to increase score variance without affecting error variance.
 Keep the items on the test of moderate difficulty.
 Add items to a test to make it more reliable. By doing this you sampling more
of the domain of interest so we can finely differentiate among examinees.
 More observed score variance with heterogeneous test items. Reliability is a
function of correlation.
AFTER BREAK:
Classical test theory review:
 An observed score is comprised of true scores plus error. X = T + e.
 Reliability is consistency in measurement.
 Need a way to estimate the error. All the stuff she just went over just now.
From the estimation of the error, we will be able to average that value and
use it to determine how close that measured score is to a person’s true
score. This is the goal in testing. Estimating error to get to the true test score.
Important to remember.
1. The assumption of classical test theory as relates to error – The error will
average out to a mean of zero in a group of scores.
2. How systematic error affects reliability? Two ways – sample of questions and
time of administration.
3. The methods of assessing reliability – today’s lecture.
4. How to maximize reliability? Make the test longer, heterogeneous test group,
questions of moderate difficulty.
19
STANDARD ERROR OF MEASUREMENT – S.E.M.
How much error variance on a test
(Don’t need to know the equation, just know it exists.)
What does it measure conceptually? Remember the hypothetical bell curve
that says error will be distributed evenly if we give the same test over and over
an infinite # of times. The SEM is the standard deviation of this hypothetical
distribution. The mean of which is the person’s true score. Assumes that error is
normally distributed, falling one SD above or below the mean. We use SEM to
create a competence interval around the observed scores from a given
individual on a given test.
The higher the reliability on a given test, the smaller the SEM. Then the
smaller range within which we would predict an individual’s true score would fall.
Note that there is an inverse relationship.
VALIDITY p. 107
The extent to which it measures what it claims to measure. It is the most
fundamental and important characteristic of a test. How useful the test is. What
conclusions can be made?
Both unsystematic and systematic error pose a problem. A test can be
reliable without being valid, but it cannot be valid unless it is also reliable.
 Face validity – it looks like it is the “correct test.” An IQ test doesn’t measure
a person’s weight.
 Content validity – how we evaluate the items on the test. Extent to which the
questions on the test are represented in what we are trying to assess. There
are NO formulas to find this out; we have “experts” in their particular field to
look at the tests. The teachers are the experts who evaluate the questions.
Content validity will be important for tests that have a well-defined domain of
content. You can be very specific on questions, but content validity won’t be
good for personality, too many opinions.
Criterion related validity
Make tests to make predictions. The most practical use of test scores.
Establish a relationship between test scores and the behavior we’re interested
in from those test scores. When we use tests to make predictions we need to
establish the relationship between test scores and the behavior we’re trying to
get. The correlation coefficient we’re trying to get is computed between test
20
scores and criterion values from an individual. Criterion is the behavior we’re
looking for. They’re two types of criterion validity:
1. Concurrent validity – Correlation between test scores and the measure of the
criterion which is available at the time of testing. The goal is to use the
information about that correlation between the test scores and the criterion
and to use that information to make predictions about that individual that we
don’t yet know their criterion measure. MMPI test does this.
2. Predictive validity – Test scores are used to estimate outcome measures
obtained at a later date. Predictive validity is relevant for entrance exams and
employment tests. Use test scores to predict something that is not available
at the time of testing – we don’t know if a student will get A’s or C’s after he
takes his SAT test, we just predict he will based on how will he did. Then
later we correlate their college grades to their high school SAT tests to see if
the SAT had good predictive validity.
This value has practical applications when it accurately predicts how a person
will do later in his career or school GPA. It would save a lot of money for
businesses if it were correct.
Standard Error of the Estimate p.113
This is the margin of error to be expected in the predicted criterion score. The
formula helps answer the question, “How accurately can criterion performance
be predicted from test scores.”
Figure 4.4 Possible outcomes when a selection test is used to predict
performance on a criterion measure. There is an inverse relationship to one
another.
 False positives – people who are predicted to succeed, fail.
 False negatives – people who are predicted to fail, succeed.
Factors affecting criterion related validity: Anything that will affect the correlation
coefficient.
 Group differences - We’ll get higher validity coefficients with a more
heterogeneous group because there’s more variance.
 Test length. A longer test produces more variance.
21
 Criterion contamination.
 Base – rate of criterion – the proportion of successful applicants who would
be selected using current methods, without the benefit of the new test.
 Incremental validity – Use a test as part of a battery, not a prediction. Are we
able to make a more accurate prediction of the criterion by using the
additional information that we get from the test scores?
 Test reliability – necessary but sufficient enough for validity.
Construct validity
A theoretical, intangible quality or trait in which individuals differ. Examples
are leadership ability, depression and intelligence. A test designed to measure a
construct must estimate the existence of an inferred underlying characteristic
based on a limited sample of behavior. Does not have a well-defined domain of
content. Possesses two characteristics:
 There is no single criterion to validate the existence of the construct; it cannot
be operationally defined.
 A network of interlocking suppositions can be derived from existing theory
about the construct.
Construct validity – to evaluate the construct validity of a test, we must amass a
variety of evidence from numerous sources. It is supported whenever content,
concurrent and/or predictive validity is supported. This is why construct validity
covers all other validities.
 Expert ratings
 High internal consistency coefficients – supports construct validity.
 Experimental and observational research – Also support construct validity, if
research validates the relationship between test scores and what the theory
about the construct would predict. If cognitive therapy works for depression
then that data supports the construct validity.
 Test performance related interventions
 Interconnections among tests.
22
Construct validity
There is no external referent sufficient to validate the existence of the construct,
that is, the construct cannot be operationally defined. Nonetheless, a network of
interlocking suppositions can be derived from existing theory about the
construct.
 Convergent validity p. 121, is demonstrated when a test correlates highly
with other variables or tests with which it shares an overlap of constructs.
Divergent validity - Determined when we see a pattern of correlation among
test scores. Where the tests that are designed to measure different
constructs. Ex. test for depression & intelligence should have no correlation
between these two constructs.
23
Testing 461 /3 – 1 – 99/4th lecture
CHAPTER 4B – THE TESTING PROCESS
Test construction consists of six intertwined stages.
1. Defining the test
2. Selecting a scaling method
3. Constructing the items
4. Revising the test
5. Publishing the test
 Defining the test – what’s it going to be used for. The developer must have a
clear idea of what the test is to measure and how it is to differ from existing
tests.
 Scaling method – Assigning #’s to responses so that the examinee can be
judged to have more or less of the characteristic measured.
 Constructing the items on the test – creating each question (items) for the
test.
 Test out those items through item analysis then…
 Revise the test. Reviewing the test, throwing out some items and revising old
questions. So it is a continuous feedback loop until we finally publish the test.
 Publication of the test.




SCALES OF MEASUREMENT
All #’s derived from measurement instruments of any kind can be placed into
one of 4 hierarchical categories: Nominal, Ordinal, Interval and Ratio.
(qualitative and quantitative)
Nominal – the #’s serve only as category items. Ex. male and female is
designated as 1 or 2. (Remember education was in-between). A
QUALITATIVE SCALE.
Ordinal scales – constitute a form of ranking. Ex. A military ranking from
private to General. Ex. Runners in a race, we don’t know what the time
difference there was between runners crossing the finish line.
Interval scales – provides information about ranking but also supplies a
metric for gauging the differences between rankings. No zero point. Ex. Math
aptitude test
Ratio scales – Has all the characteristics of an interval scale but also
possesses a conceptually meaningful zero point which there is a total
absence of the characteristic being measured. Rare in psychological
measurement. Ex. height, education, blood pressure.
24




METHODS OF SCALING p. 132
Expert ratings – experts assign a # to reflect a person’s level on the
construct. Basically an ordinal scale.
Method of Equal Appearing intervals – Appearing Intervals a.k.a.: Thurstones
method of equals. Method of constructing interval level scales from attitude
statements. Ex. Answer true/false to the following statements. Then have
experts look at the statements and determine if the items are positive or
negative. P. 133
Guttman Scales – Respondents endorsing one statement in an ordinal
sequence are assumed. Respondents who endorse one statement also
agree with milder statements pertinent to the same underlying continuum.
Likert Scales – Scaling method used to assess attitudes. Also known as
Summative scales because the total scale is obtained by adding the scores
from individual items. True Likert scales uses 5 responses that range from
strongly agree to strongly disagree. But someone could just choose neutral
all the time.
These scaling methods above rely upon the authoritative judgements of experts in the selection and
ordering of items. It is possible to construct measurement scales based entirely on empirical
considerations devoid of theory or expert judgement. This is…
 Empirical/criterion keying – We construct a bunch of items and try them
out and determine what works. Test items are selected for a scale based
entirely on how well they contrast a criterion group from a normative sample.
A common finding is that some items selected for a scale may show no
obvious relationship to the construct being measured. Ex. “I drink a lot of
water” (keyed true) might end up on a Depression scale. It works, so we use
it on the test. From the practical standpoint of empirical test construction,
theoretical explanations of “why” are of secondary importance. If it works,
use it.
25
Class example: We want to develop a scale of dependency
SCALE OF DEPENDENCY (SOD)
We want it so if a person scores low on our test then that means he’s very independent. If he scores
high then he is very dependent. We need to define the test first. We want it to be brief and easy to
administer and useful in marital therapy for everyone. Variable like “dependency” should show a bell
curve – we all have some degree of it.
We want 15 items on the test so we start out with twice that number. We give the test to both
dependent and independent people and look at the discrepancy in scores. A question might be: I like to
assert myself – would be a sign of independence or “I am a leader of a group”. We’re just getting an
inventory of items. We want some negative and positive answers in our survey or else the examinee
might just answer “YES” to everything.
TABLE OF SPECIFICATIONS
Professional developers of achievement and ability tests often use one or more
item – writing schemes to help ensure that their instrument taps a desired
mixture of cognitive processes and content domains. TABLE 4.6 – a matrix
table.
 Behavior objective
 Knowledge of terminology
 Specific facts
 Comprehension
 Application
ITEM FORMATS – one method by which psychological attributes are to be
assessed.
 Multiple choice questions – allow 45 seconds to one minute for each item.
The wrong answers are the ones that throw you off are called “The
Distracters”.
 Matching
 Short answer – ex. defining terminology
 True/False. Give them 30 seconds for each question.
 Forced choice format – A triad of answers that are similar. Then the
examinee is asked to pick which statement is the most and least important.
26
But you cannot say all or none are these are important, so you are forced to
choose.
 Essay – articulate on a question in a writing format. These are subjectively
graded. Give them 6 minutes to answer.
 BLIND GRADING – means that the examinee dos not write their name on
the test. So the teacher can grade the test objectively without knowing the
examinee’s name.
ITEM ANALYSIS – Determining whether a test question should be left in or
thrown out or revised. Hard to write good items.
It can tell us what the test takers know and do not know and how well the items
contribute to predicting criterion behavior.
Item validity index: Assesses the extent to which test items relate to some
external measurement of criterion. For many applications it is important that a
test possess the highest possible concurrent or predictive validity.
A function of the correlation between the item score and the external criterion
measure. Items score are going to measured dichotomously (right or wrong)
and the external criterion measure is going to be the supervisor’s rating of good
or bad.
There has to be some kind of variability in the item, everyone should not get
it right or wrong. For it to be a good test item, some should get it right and some
wrong. In order to calculate this correlation we use a POINT BISERIAL
CORRELATION COEFFICIENT, rpb.
It looks at the averages of the criterion score and the average criterion score
for everyone else who took the exam. The closer to one, the better the
predictive variability of the item. It also depends on the standard deviation so
the item validity index consists of the product of the SD and the point biserial
correlation.
Item – reliability index
Inter-item correlation –items that intercorrelate with one another collectively give
us a good measure of whatever is being assessed. Tests that are internally
consistent are test that have homogenous items, items that intercorrelate with
one another.
27
A test developer may desire an instrument with a high level of internal
consistency in which the items are reasonably homogenous. Individual items
are typically right or wrong whereas the total scores constitute a continuous
variable. The point biserial coefficient is used. The closer to one, the better the
predictive variability of the item.
OPTIMAL VALUE WHEN 50% GET IT RIGHT AND 50% WRONG
BUT…
It looks bad when the 50% who got it right were the low scorers on the test.
Homework #8 item analysis
 Item-discrimination index – called “d”. Used to tell if the items are capable
of discriminating between high and low scores. Range is from –1 to +1. A
score of 1.0 means all the people in the top 27% got the answer correct.
Perfect discriminator item, a perfect question to use. Teacher says .3 or
above is good. .2 - .29 we should revise the item. Less than .19 the item
should be discarded or completely revised.
 Item difficulty index – called “p”. p stands for percentage/proportion.
Assesses the proportion of the examinees that got the answer correct. We
accept .5 +/- .2 as a good range. We want maximum variance on the items.
So .3 to .7 is acceptable.
 Optimal difficulty – if four options on the test then the chance level will be
25% so the equation will be 1+.25/2 = .63 So if about 2/3 get the item correct,
that’s a cool item.
28
The lower bound for acceptable difficulty is a function of #’s of (k) options per
item and the # of people taking the test.
Item Validity – How well does each preliminary test item contribute to the
accurate prediction of the criterion. The item validity index is used to identify
predictively useful test items. Item validity = SD * point biserial correlation.
29
Testing 461
3 – 8 – 99
5 lecture & last lecture before first test
th



REGRESSION & PREDICTION
How we use a known correlation between two variables to predict one
variable from what we know about the other.
Explain how we use correlation to determine the predictive validity of a test.
Predictions are made by using a regression formula. The correlation between
the test and the criterion is incorporated into that formula.
We’re using our scatter plot data we did for our age and grades.
How well does age predict GPA? Is it the older you get the smarter you get or
the reverse? We need to first come up with a null hypothesis. What would the
null hypothesis be with regard to the relationship between and GPA? What
would we be trying to reject? That there is no relationship between these two
variables.
Alternate hypothesis – The correlation is not zero. Or we could predict a
relationship?
Our GPA is a bell curve. 2.9 is the mode, so we have a nice distribution. But the
age distribution is positively skewed. Most of the students are in their 20’s.
Variable GPA
 Mean = 3
 Std. Deviation. = .45
 Variance = .2
 Kurtosis = -.55
 Skewness = .12
 Range = 1.80
 Minimum = 2.10
 Maximum = 3.9
Variable Age
 Mean = 330.46
 Std. Deviation = 94.03
 Variance = 884.43
 Skewness = 1.77 (anything above 1.3 is highly skewed)
30
A perfect correlation between these two variables would be a single straight line
through the data points. A perfect +1 correlation. But we don’t have that. The
correlation between two variables is = to the slope of the best fitting line for that
set of data. We are dealing with raw data.
The regression line – The r are just a by-product of finding the best fitting line
through a set of data points. That line is the regression line and shows how
variables are related to each other. When a single is variable is used to predict
another we call it simple linear regression. If we use more than one variable to
make a prediction we call it multiple regression. Advanced statistics deal with
multiple regression.
Every dot that the regression line goes through represents the interception of
a predicted criterion value of y for a given value of x. For our data those dots
that are drawn through represents a predicted GPA for a given age.
Y1 = a + bX
When we restrict the range of the variable we restrict the range of prediction.
Restricting is when we don’t count some of the skewed scores.




Interpreting the correlation coefficient.
We are testing for the null hypothesis, that the population correlation is zero.
Therefore we need to know if the correlation we have obtained is significantly
different from zero.
Consider the magnitude. The higher the value of the r the more likely it is to
be significant. .3 is not that high. We construct a 95% confidence interval
around a population correlation of zero, so we know for sure there is a
relationship
Determine whether the value lies outside the range of 95% of the samples
drawn from a population where there is no relationship between the
variables.
We want to know if the correlation we obtained would be one that we would
get less than 5% of the time by chance if there really was a relationship so
we set our alpha at .05. We need to know the SD of rho, which is a function
of sample size.
31
We need to know the relationship between sample size and statistical
significance. Larger samples will show the correlation coefficient to be
significant at lower values than will smaller samples. With larger samples it’s
much easier to get a value that we can reject the null with. Statistical
significance is partially a function of sample size. Statistical significance does
not necessarily equate to practical significance. They are not the same thing all
the time. It doesn’t mean you’re going to explain a lot of the criterion variable.
Coefficient of determination tells us about practical significance, By squaring the
correlation we can find out how much of y can be explained by knowing x. So on
our data plot, r = .3 So squaring r we get .09. So 9% of our GPA can be
explained by looking at the age scores. So age is a lousy predictor. It would be
more helpful if we used age with other variables to predict GPA.
When we use multiple predictors to predict another criterion, we talk about
the incremental validity of a given predictor. It would work if we can explain
more of the variance of the GPA along with age to our regression formula. How
does this work? There’s always going to be error so we have to use the
STANDARD ERROR OF THE ESTIMATE to see how much we are in error. It is
an index of the average amount of error that there is in predicting criterion
scores.
Don’t get this standard error of the estimate mixed up with the standard error of
measurement which in an index of the average amount of error that there is in
that obtained score for that individual and based on test reliability.
Standard error of the estimate is an estimate of error in predicted scores, not
in scores where we actually obtained in a person, but an index in the amount of
error in the prediction we’re making for scores and it’s based on the predictive
validity coefficient between test scores and criterion scores.
The standard error of the estimate is used to tell us how wrong we are likely to
be by providing the average index. Assume error is normally distributed, Y is
just an estimate of the criterion score that an individual could obtain.
For any GPA that I predict for a person of a given age, I can use that
predictive GPA as a mean of a hypothetical distribution of GPA of people that
32
age that actually obtained. Then I can use the standard error of the estimate to
create competent intervals around the predictive criterion values. For any given
X score we can view the predicted Y as the mean of a normal distribution of Y
scores that someone with that same X score may have obtained.
In our data plot we have a huge amount of error because of our small predictive
validity. There is an inverse relationship between the standard error of the
estimate and the predictive validity coefficient. The higher the correlation
between the test and the criterion, the lower the standard error of the estimate.
Research PAPER
 Section 1, Tell about the construct, how we got the assessment the
construct. Are there tests to develop this construct? Few or a lot? When and
whom develop them. How far has psychology come in addressing this
construct.
 Section 2, explain the most important type of validity that would be necessary
to make a test of this construct useful.
 Section 3, pick a journal article that describes this construct. Look at
handout. Also attach the journal to the back of the report. Must have at least
5 references for the paper. You can use our text, the MMY, NOT tests in print
and at least 2 journal articles. Newspapers articles too.
Label each section 1, 2, and 3. Provide citations where appropriate. Limit direct
quotes. (don’t do any.) List references and the copy of the journal articles.
DUE APRIL 12TH.
33