Download reliability and validity

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Psychometrics wikipedia , lookup

Transcript
Validity/Reliability and a
recap of Statistics
RCS 6740
6/27/05
RELIABILITY AND VALIDITY
Reliability
From the perspective of classical test theory, an
examinee's obtained test score (X) is composed of
two components, a true score component (T) and
an error component (E):
X=T+E
The true score component reflects the examinee's
status with regard to the attribute that is
measured by the test, while the error component
represents measurement error. Measurement error
is random error. It is due to factors that are
irrelevant to what is being measured by the test
and that have an unpredictable (unsystematic)
effect on an examinee's test score.
RELIABILITY AND VALIDITY
The score you obtain on a test is likely to be due both to the
knowledge you have about the topics addressed by exam
items (T) and the effects of random factors (E) such as the
way test items are written, any alterations in anxiety,
attention, or motivation you experience while taking the test,
and the accuracy of your "educated guesses."
Whenever we administer a test to examinees, we would like
to know how much of their scores reflects "truth" and how
much reflects error. It is a measure of reliability that provides
us with an estimate of the proportion of variability in
examinees' obtained scores that is due to true differences
among examinees on the attribute(s) measured by the test.
RELIABILITY AND VALIDITY
Reliability
When a test is reliable, it provides dependable,
consistent results and, for this reason, the term
consistency is often given as a synonym for
reliability (e.g., Anastasi, 1988).
RELIABILITY AND VALIDITY
The Reliability Coefficient
Ideally, a test's reliability would be calculated by dividing true
score variance by the obtained (total) variance to derive a
reliability index. This index would indicate the proportion of
observed variability in test scores that reflects true score
variability. A test's true score variance is not known, however,
and reliability must be estimated rather than calculated
directly. There are several ways to estimate a test's reliability.
Each involves assessing the consistency of an examinee's
scores over time, across different content samples, or across
different scorers and is based on the assumption that
variability that is consistent is true score variability, while
variability that is inconsistent reflects random error.
RELIABILITY AND VALIDITY
Most methods for estimating reliability produce a reliability
coefficient, which is a correlation coefficient that ranges in
value from 0.0 to + 1.0. When a test's reliability coefficient is
0.0, this means that all variability in obtained test scores is
due to measurement error. Conversely, when a test's reliability
coefficient is + 1.0, this indicates that all variability in scores
reflects true score variability. The reliability coefficient is
symbolized with the letter "r" and a subscript that contains
two of the same letters or numbers (e.g., ''rxx''). The subscript
indicates that the correlation coefficient was calculated by
correlating a test with itself rather than with some other
measure.
RELIABILITY AND VALIDITY
Regardless of the method used to calculate a reliability
coefficient, the coefficient is interpreted directly as the
proportion of variability in obtained test scores that reflects
true score variability. For example, as depicted in Figure 1, a
reliability coefficient of .84 indicates that 84% of variability in
scores is due to true score differences among examinees,
while the remaining 16% (1.00 - .84) is due to measurement
error.
True Score Variability (84%)
Error (16%)
Figure 1. Proportion of variability in test scores
RELIABILITY AND VALIDITY
Note that a reliability coefficient does not provide
any information about what is actually being
measured by a test. A reliability coefficient only
indicates whether the attribute measured by the
test— whatever it is—is being assessed in a
consistent, precise way. Whether the test is actually
assessing what it was designed to measure is
addressed by an analysis of the test's validity.
RELIABILITY AND VALIDITY
Study Tip: Remember that, in contrast to other correlation
coefficients, the reliability coefficient is never squared to
interpret it but is interpreted directly as a measure of true
score variability. A reliability coefficient of .89 means that
89% of variability in obtained scores is true score variability.
RELIABILITY AND VALIDITY
Methods for Estimating Reliability
The selection of a method for estimating reliability
depends on the nature of the test. As noted below,
each method not only entails different procedures
but is also affected by different sources of error. For
many tests, more than one method should be used.
RELIABILITY AND VALIDITY
1. Test-Retest Reliability: The test-retest method for
estimating reliability involves administering the same test to
the same group of examinees on two different occasions and
then correlating the two sets of scores. When using this
method, the reliability coefficient indicates the degree of
stability (consistency) of examinees' scores over time and is
also known as the coefficient of stability.
The primary sources of measurement error for test-retest
reliability are any random factors related to the time that
passes between the two administrations of the test. These
time sampling factors include random fluctuations in
examinees over time (e.g., changes in anxiety or motivation)
and random variations in the testing situation. Memory and
practice also contribute to error when they have random
carryover effects; i.e., when they affect many or all examinees
but not in the same way.
RELIABILITY AND VALIDITY
Test-retest reliability is appropriate for determining the
reliability of tests designed to measure attributes that are
relatively stable over time and that are not affected by
repeated measurement. It would be appropriate for a test of
aptitude, which is a stable characteristic, but not for a test of
mood, since mood fluctuates over time, or a test of creativity,
which might be affected by previous exposure to test items.
RELIABILITY AND VALIDITY
2. Alternate (Equivalent, Parallel) Forms Reliability: To assess
a test's alternate forms reliability, two equivalent forms of the
test are administered to the same group of examinees and
the two sets of scores are correlated. Alternate forms
reliability indicates the consistency of responding to different
item samples (the two test forms) and, when the forms are
administered at different times, the consistency of
responding over time. The alternate forms reliability
coefficient is also called the coefficient of equivalence when
the two forms are administered at about the same time and
the coefficient of equivalence and stability when a relatively
long period of time separates administration of the two
forms.
RELIABILITY AND VALIDITY
The primary source of measurement error for alternate forms
reliability is content sampling, or error introduced by an
interaction between different examinees' knowledge and the
different content assessed by the items included in the two
forms: The items in Form A might be a better match of one
examinee's knowledge than items in Form B, while the
opposite is true for another examinee. In this situation, the
two scores obtained by each examinee will differ, which will
lower the alternate forms reliability coefficient. When
administration of the two forms is separated by a period of
time, time sampling factors also contribute to error.
RELIABILITY AND VALIDITY
Like test-retest reliability, alternate forms reliability is not
appropriate when the attribute measured by the test is likely
to fluctuate over time (and the forms will be administered at
different times) or when scores are likely to be affected by
repeated measurement. If the same strategies required to
solve problems on Form A are used to solve problems on
Form B, even if the problems on the two forms are not
identical, there are likely to be practice effects. When these
effects differ for different examinees (i.e., are random),
practice will serve as a source of measurement error.
Although alternate forms reliability is considered by some
experts to be the most rigorous (and best) method for
estimating reliability, it is not often assessed due to the
difficulty in developing forms that are truly equivalent.
RELIABILITY AND VALIDITY
3. Internal Consistency Reliability: Reliability can also be
estimated by measuring the internal consistency of a test.
Split-half reliability and coefficient alpha are two methods for
evaluating internal consistency. Both involve administering
the test once to a single group of examinees, and both yield
a reliability coefficient that is also known as the coefficient of
internal consistency.
To determine a test's split-half reliability, the test is split into
equal halves so that each examinee has two scores (one for
each half of the test). Scores on the two halves are then
correlated. Tests can be split in several ways, but probably
the most common way is to divide the test on the basis of
odd- versus even-numbered items.
RELIABILITY AND VALIDITY
A problem with the split-half method is that it produces a reliability coefficient that is
based on test scores that were derived from one-half of the entire length of the
test. If a test contains 30 items, each score is based on 15 items. Because
reliability tends to decrease as the length of a test decreases, the split-half
reliability coefficient usually underestimates a test's true reliability. For this reason,
the split-half reliability coefficient is ordinarily corrected using the Spearman-Brown
prophecy formula, which provides an estimate of what the reliability coefficient
would have been had it been based on the full length of the test.
Cronbach's coefficient alpha also involves administering the test once to a single
group of examinees. However, rather than splitting the test in half, a special
formula is used to determine the average degree of inter-item consistency. One
way to interpret coefficient alpha is as the average reliability that would be
obtained from all possible splits of the test. Coefficient alpha tends to be
conservative and can be considered the lower boundary of a test's reliability
(Novick and Lewis, 1967). When test items are scored dichotomously (right or
wrong), a variation of coefficient alpha known as the Kuder-Richardson Formula 20
(KR-20) can be used.
RELIABILITY AND VALIDITY
Content sampling is a source of error for both split-half reliability
and coefficient alpha. For split-half reliability, content sampling
refers to the error resulting from differences between the content of
the two halves of the test (i.e., the items included in one half may
better fit the knowledge of some examinees than items in the other
half); for coefficient alpha, content (item) sampling refers to
differences between individual test items rather than between test
halves. Coefficient alpha also has as a source of error, the
heterogeneity of the content domain. A test is heterogeneous with
regard to content domain when its items measure several different
domains of knowledge or behavior. The greater the heterogeneity of
the content domain, the lower the inter-item correlations and the
lower the magnitude of coefficient alpha. Coefficient alpha could be
expected to be smaller for a 200-item test that contains items
assessing knowledge of test construction, statistics, ethics,
industrial-organizational psychology, clinical psychology, etc. than
for a 200-item test that contains questions on test construction
only.
RELIABILITY AND VALIDITY
The methods for assessing internal consistency reliability are
useful when a test is designed to measure a single
characteristic, when the characteristic measured by the test
fluctuates over time, or when scores are likely to be affected
by repeated exposure to the test. They are not appropriate
for assessing the reliability of speed tests because, for these
tests, they tend to produce spuriously high coefficients. (For
speed tests, alternate forms reliability is usually the best
choice.)
RELIABILITY AND VALIDITY
4. Inter-Rater (Inter-Scorer, Inter-Observer) Reliability: Inter-rater
reliability is of concern whenever test scores depend on a rater's
judgment. A test constructor would want to make sure that an
essay test, a behavioral observation scale, or a projective
personality test have adequate inter-rater reliability. This type of
reliability is assessed either by calculating a correlation coefficient
(e.g., a kappa coefficient or coefficient of concordance) or by
determining the percent agreement between two or more raters.
Although the latter technique is frequently used, it can lead to
erroneous conclusions since it does not take into account the level
of agreement that would have occurred by chance alone. This is a
particular problem for behavioral observation scales that require
raters to record the frequency of a specific behavior. In this
situation, the degree of chance agreement is high whenever the
behavior has a high rate of occurrence, and percent agreement will
provide an inflated estimate of the measure's reliability.
RELIABILITY AND VALIDITY
Sources of error for inter-rater reliability include factors related to
the raters such as lack of motivation and rater biases and
characteristics of the measuring device. An inter-rater reliability
coefficient is likely to be low, for instance, when rating categories
are not exhaustive (i.e., don't include all possible responses or
behaviors) and/or are not mutually exclusive.
The inter-rater reliability of a behavioral rating scale can also
be affected by consensual observer drift, which occurs when
two (or more) observers working together influence each
other's ratings so that they both assign ratings in a similarly
idiosyncratic way. (Observer drift can also affect a single
observer's ratings when he or she assigns ratings in a
consistently deviant way.) Unlike other sources of error,
consensual observer drift tends to artificially inflate interrater reliability.
RELIABILITY AND VALIDITY
The reliability (and validity) of ratings can be improved in
several ways. Consensual observer drift can be eliminated by
having raters work independently or by alternating raters.
Rating accuracy is also improved when raters are told that
their ratings will be checked. Overall, the best way to
improve both inter- and intra-rater accuracy is to provide
raters with training that emphasizes the distinction between
observation and interpretation (Aiken, 1985).
RELIABILITY AND VALIDITY
Study Tip: Remember the Spearman-Brown formula is related
to split-half reliability and KR-20 is related to the coefficient
alpha. Also know that alternate forms reliability is the most
thorough method for estimating reliability and that internal
consistency reliability is not appropriate for speed tests.
RELIABILITY AND VALIDITY
Factors That Affect The Reliability Coefficient
The magnitude of the reliability coefficient is affected not only
by the sources of error discussed above but also by the
length of the test, the range of the test scores, and the
probability that the correct response to items can be selected
by guessing.



Test Length
Range of Test Scores
Guessing
RELIABILITY AND VALIDITY
1. Test Length: The larger the sample of the attribute being measured
by a test, the less the relative effects of measurement error and the
more likely the sample will provide dependable, consistent
information. Consequently, a general rule is that the longer the test,
the larger the test's reliability coefficient.
The Spearman-Brown prophecy formula is most associated with
split-half reliability but can actually be used whenever a test
developer wants to estimate the effects of lengthening or shortening
a test on its reliability coefficient. For instance, if a 100-item test
has a reliability coefficient of .84, the Spearman-Brown formula
could be used to estimate the effects of increasing the number of
items to 150 or reducing the number to 50. A problem with the
Spearman-Brown formula is that it does not always yield an
accurate estimate of reliability: In general, it tends to overestimate
a test's true reliability (Gay, 1992).
RELIABILITY AND VALIDITY
This is most likely to be the case when the added items do not measure
the same content domain as the original items and/or are more
susceptible to the effects of measurement error. Note that, when used to
correct the split-half reliability coefficient, the situation is more complex,
and this generalization does not always apply: When the two halves are
not equivalent in terms of their means and standard deviations, the
Spearman-Brown formula may either over- or underestimate the test's
actual reliability.
2. Range of Test Scores: Since the reliability coefficient is a correlation
coefficient, it is maximized when the range of scores is unrestricted. The
range is directly affected by the de£ee of similarity of examinees with
regard to the attribute measured by the test: When examinees are
heterogeneous, the range of scores is maximized. The range is also
affected by the difficulty level of the test items. When all items are either
very difficult or very easy, all examinees will obtain either low or high
scores, resulting in a restricted range. Therefore, the best strategy is to
choose items so that the average difficulty level is in the mid-range (r =
.50).
RELIABILITY AND VALIDITY
Guessing: A test's reliability coefficient is also affected by the
probability that examinees can guess the correct answers to
test items. As the probability of correctly guessing answers
increases, the reliability coefficient decreases. All other things
being equal, a true/false test will have a lower reliability
coefficient than a four-alternative multiple-choice test which,
in turn, will have a lower reliability coefficient than a free
recall test.
RELIABILITY AND VALIDITY
The Interpretation of Reliability
The interpretation of a test's reliability entails considering its
effects on the scores achieved by a group of examinees as
well as the score obtained by a single examinee.
1. The Reliability Coefficient: As discussed above, a reliability
coefficient is interpreted directly as the proportion of
variability in a set of test scores that is attributable to true
score variability. A reliability coefficient of .84 indicates that
84% of variability in test scores is due to true score
differences among examinees, while the remaining 16% is
due to measurement error. While different types of tests can
be expected to have different levels of reliability, for most
tests, reliability coefficients of .80 or larger are considered
acceptable.
RELIABILITY AND VALIDITY
When interpreting a reliability coefficient, it is important to
keep in mind that there is no single index of reliability for a
given test. Instead, a test's reliability coefficient can vary
from situation to situation and sample to sample. Ability
tests, for example, typically have different reliability
coefficients for groups of individuals of different ages or
ability levels.
RELIABILITY AND VALIDITY
2. The Standard Error of Measurement: While the reliability coefficient is
useful for estimating the proportion of true score variability in a set of test
scores, it is not particularly helpful for interpreting an individual
examinee's obtained test score. When an examinee receives a score of 80
on a 100-item test that has a reliability coefficient of .84, for instance, we
can only conclude that, since the test is not perfectly reliable, the
examinee's obtained score might or might not be his or her true score.
A common practice when interpreting an examinee-s obtained score is to
construct a confidence interval around that score. The confidence interval
helps a test user estimate the range within which an examinee's true
score is likely to fall given his or her obtained score. This range is
calculated using the standard error of measurement, which is an index of
the amount of error that can be expected in obtained scores due to the
unreliability of the test. (When raw scores have been converted to
percentile ranks, the confidence interval is referred to as a percentile
band.)
RELIABILITY AND VALIDITY
The following formula is used to estimate the
standard error of measurement:
Formula 1: Standard Error of Measurement
SEmeas = SDx *(1 – rxx)1/2
Where:
SEmeas = standard error of measurement
SDx = standard deviation of test scores
rxx= reliability coefficient
RELIABILITY AND VALIDITY
As shown by the formula, the magnitude of the standard
error is affected by two factors: the standard deviation of the
test scores and the test's reliability coefficient. The lower the
test's standard deviation and the higher its reliability
coefficient, the smaller the standard error of measurement
(and vice versa).
Because the standard error is a type of standard deviation, it
can be interpreted in terms of the areas under the normal
curve. With regard to confidence intervals, this means that a
68% confidence interval is constructed by adding and
subtracting one standard error to an examinee's obtained
score; a 95% confidence interval is constructed by adding
and subtracting two standard errors; and a 99% confidence
interval is constructed by adding and subtracting three
standard errors.
RELIABILITY AND VALIDITY
Example: The psychologist in Study #3 administers the
interpersonal assertiveness test to a sales applicant who receives a
score of 80. Since the test's reliability is less than 1.0, the
psychologist knows that this score might be an imprecise estimate
of the applicant's true score and decides to use the standard error
of measurement to construct a 95% confidence interval. Assuming
that the test-s reliability coefficient is .84 and its standard deviation
is 10, the standard error of measurement is equal to 4.0:
SEmeas = SDx 1 – rxx =10 (1 - .84)1/2 = 10(.4) = 4.0
The psychologist constructs the 95% confidence interval by adding
and subtracting two standard errors from the applicant's obtained
score: 80 + 2(4.0) = 72 to 88. This means that there is a 95%
chance that the applicant's true score falls between 72 and 88.
RELIABILITY AND VALIDITY
One problem with the standard error is that measurement
error is not usually equally distributed throughout the range
of test scores. Use of the same standard error to construct
confidence intervals for all scores in a distribution can,
therefore, be somewhat misleading. To overcome this
problem, some test manuals report different standard errors
for different score intervals.
RELIABILITY AND VALIDITY
3. Estimating True Scores from Obtained Scores: As discussed
above, because of the effects of measurement error, obtained
test scores tend to be biased (inaccurate) estimates of true
scores. More specifically, scores above the mean of a
distribution tend to overestimate true scores, while scores
below the mean tend to underestimate true scores. Moreover,
the farther from the mean an obtained score is, the greater
this bias. Rather than constructing a confidence interval, an
alternative (but less used) method for interpreting an
examinee's obtained test score is to estimate his/her true
score using a formula that takes into account this bias by
adjusting the obtained score using the mean of the
distribution and the test's reliability coefficient.
RELIABILITY AND VALIDITY
For example, if an examinee obtains a score of 80 on a test
that has a mean of 70 and a reliability coefficient of .84, the
formula predicts that the examinee's true score is 78.2.
T’=a + bX
=(1-rxx )X + rxx X
T’=(1-.84) x 70 + .84 x 80
=.16 x 70 + .84 x 80
=11.2 + 67=78.2
RELIABILITY AND VALIDITY
4. The Reliability of Difference Scores: A test user is sometimes
interested in comparing the performance of an examinee on
two different tests or subtests and, therefore, computes a
difference score. An educational psychologist, for instance,
might calculate the difference between a child's WISC-R
Verbal and Performance 19 scores. When doing so, it is
important to keep in mind that the reliability coefficient for
the difference scores can be no larger than the average of the
reliabilities of the two tests or subtests: If Test A has a
reliability coefficient of .95 and Test B has a reliability
coefficient of .85, this means that difference scores calculated
from the two tests will have a reliability coefficient of .90 or
less. The exact size of the reliability coefficient for difference
scores depends on the degree of correlation between the two
tests: The more highly correlated the tests, the smaller the
reliability coefficient (and the larger the standard error of
measurement).
RELIABILITY AND VALIDITY
Validity
Validity refers to a test's accuracy. A test is valid when it measures
what it is intended to measure. The intended uses for most tests fall
into one of three categories, and each category is associated with a
different method for establishing validity:



The test is used to obtain information about an examinee's
familiarity with a particular content or behavior domain: content
validity.
The test is administered to determine the extent to which an
examinee possesses a particular hypothetical trait: construct
validity.
The test is used to estimate or predict an examinee's standing or
performance on an external criterion: criterion-related validity.
RELIABILITY AND VALIDITY
For some tests, it is necessary to demonstrate only one type of
validity; for others, it is desirable to establish more than one
type. For example, if an arithmetic achievement test will be
used to assess the classroom learning of 8th grade students,
establishing the test's content validity would be sufficient. If
the same test will be used to predict the performance of 8th
grade students in an advanced high school math class, the
test's content and criterion-related validity will both be of
concern.
Note that, even when a test is found valid for a particular
purpose, it might not be valid for that purpose for all people. It
is quite possible for a test to be a valid measure of intelligence
or a valid predictor of job performance for one group of people
but not for another group.
RELIABILITY AND VALIDITY
Content Validity
A test has content validity to the extent that it adequately samples the content or
behavior domain that it is designed to measure. If test items are not a good
sample, results of testing will be misleading. Although content validation is
sometimes used to establish the validity of personality, aptitude, and attitude tests,
it is most associated with achievement-type tests that measure knowledge of one
or more content domains and with tests designed to assess a well-defined behavior
domain. Adequate content validity would be important for a statistics test and for a
work (job) sample test.
Content validity is usually "built into" a test as it is constructed through a
systematic, logical, and qualitative process that involves clearly identifying the
content or behavior domain to be sampled and then writing or selecting items that
represent that domain. Once a test has been developed, the establishment of
content validity relies primarily on the judgment of subject matter experts. If
experts agree that test items are an adequate and representative sample of the
target domain, then the test is said to have content validity.
RELIABILITY AND VALIDITY
Although content validation depends mainly on the judgment of experts,
supplemental quantitative evidence can be obtained. If a test has
adequate content validity, a coefficient of internal consistency will be
large; the test will correlate highly with other tests that purport to
measure the same domain; and pre-/post-test evaluations of a program
designed to increase familiarity with the domain will indicate appropriate
changes.
Content validity must not be confused with face validity. Content validity
refers to the systematic evaluation of a test by experts who determine
whether or not test items adequately sample the relevant domain, while
face validity refers simply to whether or not a test "looks like" it measures
what it is intended to measure. Although face validity is not an actual type
of validity, it is a desirable feature for many tests. If a test lacks face
validity, examinees may not be motivated to respond to items in an
honest or accurate manner. A high degree of face validity does not,
however, indicate that a test has content validity.
RELIABILITY AND VALIDITY
Construct Validity
When a test has been found to measure the hypothetical trait
(construct) it is intended to measure, the test is said to have
construct validity. A construct is an abstract characteristic
that cannot be observed directly but must be inferred by
observing its effects. lntelligence, mechanical aptitude, selfesteem, and neuroticism are all constructs.
There is no single way to establish a test's construct validity.
Instead, construct validation entails a systematic
accumulation of evidence showing that the test actually
measures the construct it was designed to measure. The
various methods used to establish this type of validity each
answer a slightly different question about the construct and
include the following:
RELIABILITY AND VALIDITY



Assessing the test's internal consistency: Do scores on
individual test items correlate highly with the total test
score; i.e., are all of the test items measuring the same
construct?
Studying group differences: Do scores on the test
accurately distinguish between people who are known to
have different levels of the construct?
Conducting research to test hypotheses about the
construct: Do test scores change, following an
experimental manipulation, in the direction predicted by
the theory underlying the construct?
RELIABILITY AND VALIDITY


Assessing the test's convergent and discriminant validity: Does
the test have high correlations with measures of the same trait
(convergent validity) and low correlations with measures of
unrelated traits (discriminant validity)?
Assessing the test's factorial validity: Does the test have the
factorial composition it would be expected to have; i.e., does it
have factorial validity?
RELIABILITY AND VALIDITY
Construct validity is said to be the most theory-laden of the
methods of test validation. The developer of a test designed
to measure a construct begins with a theory about the nature
of the construct, which then guides the test developer in
selecting test items and in choosing the methods for
establishing the test's validity. For example, if the developer
of a creativity test believes that creativity is unrelated to
general intelligence, that creativity is an innate characteristic
that cannot be learned, and that creative people can be
expected to generate more alternative solutions to certain
types of problems than non-creative people, she would want
to determine the correlation between scores on the creativity
test and a measure of intelligence, see if a course in
creativity affects test scores, and find out if test scores
distinguish between people who differ in the number of
solutions they generate to relevant problems.
RELIABILITY AND VALIDITY
Note that some experts consider construct validity
to be the most basic form of validity because the
techniques involved in establshing construct
validity overlap those used to determine if a test
has content or criterion-related validity. Indeed,
Cronbach argues that "all validation is one, and in
a sense all is construct validation."
RELIABILITY AND VALIDITY
Construct Validity
1. Convergent and Discriminant Validity: As noted
above, one way to assess a test's construct validity is
to correlate test scores with scores on measures that
do and do not purport to assess the same trait. High
correlations with measures of the same trait provide
evidence of the test's convergent validity, while low
correlations with measures of unrelated
characteristics provide evidence of the test's
discriminant (divergent) validity.
RELIABILITY AND VALIDITY
The multitrait-multimethod matrix (Campbell & Fiske, 1959)
is used to systematically organize the data collected when
assessing a test's convergent and discriminant validity. The
multitrait-multimethod matrix is a table of correlation
coefficients, and, as its name suggests, it provides
information about the degree of association between two or
more traits that have each been assessed using two or more
methods. When the correlations between different methods
measuring the same trait are larger than the correlations
between the same and different methods measuring different
traits, the matrix provides evidence of the test's convergent
and discriminant validity.
RELIABILITY AND VALIDITY
Example: To assess the construct validity of the
interpersonal assertiveness test, the psychologist in Study
#3 administers four measures to a group of salespeople: ( 1
) the test of interpersonal assertiveness; (2) a supervisor's
rating of interpersonal assertiveness; (3) a test of
aggressiveness; and (4) a supervisor's rating of
aggressiveness. The psychologist has the minimum data
needed to construct a multitrait-multimethod matrix: She
has measured two traits that she believes are unrelated
(assertiveness and aggressiveness), and each trait has been
measured by two different methods (a test and a supervisors rating). The psychologist calculates correlation coefficients
for all possible pairs of scores on the four measures and
constructs the following multitrait-multimethod matrix (the
upper half of the table has not been filled in because it would
simply duplicate the correlations in the lower half):
RELIABILITY AND VALIDITY
A1
B1
A2
B2
Assertiveness
Test
Aggressiveness
Test
Assertiveness
Rating
Aggressiveness
Rating
A1
rA1A1
(.93)
B1
rB1A1
(.13)
rB1B1
(.88)
A2
rA2A1
(.71)
rA2B1
(.03)
rA2A2
(.76)
B2
rB2A1
(.04)
rB2B1
(.71)
rB2A2
(.16)
rB2B2
(.70)
RELIABILITY AND VALIDITY
All multitrait-multimethod matrices contain four
types of correlation coefficients:




Monotrait-monomethod coefficients ("same trait-same
method")
Monotrait-heteromethod coefficients ("same trait-different
methods")
Heterotrait-monomethod coefficients ("different traitssame method")
Heterotrait-heteromethod coefficients ("different traitsdifferent methods“)
RELIABILITY AND VALIDITY
1. Monotrait-monomethod coefficients ("same traitsame method"): The monotrait-monomethod
coefficients (coefficients in parentheses in the
above matrix) are reliability coefficients: They
indicate the correlation between a measure and
itself. Although these coeffcients are not directly
relevant to a test's convergent and discriminant
validity, they should be large in order for the
matrix to provide useful information.
RELIABILITY AND VALIDITY
2. Monotrait-heteromethod coefficients ("same traitdifferent methods"): These coefficients
(coefficients in rectangles) indicate the correlation
between different measures of the same trait.
When these coefficients are large, they provide
evidence of convergent validity.
RELIABILITY AND VALIDITY
3. Heterotrait-monomethod coefficients ("different
traits-same method"): These coefficients
(coefficients in ellipses) show the correlation
between different traits that have been measured
by the same method. When the heterotraitmonomethod coefficients are small, this indicates
that a test has discriminant validity.
RELIABILITY AND VALIDITY
4. Heterotrait-heteromethod coefficients ("different
traits-different methods"): The heterotraitheteromethod coefficients (underlined coefficients)
indicate the correlation between different traits
that have been measured by different methods.
These coefficients also provide evidence of
discriminant validity when they are small
RELIABILITY AND VALIDITY
Note that, in a multitrait-multimethod matrix, only
those correlation coefficients that include the test
that is being validated are actually of interest. For
the above example, the correlation between the
rating of interpersonal assertiveness and the rating
of aggressiveness (r = .16) is a heterotraitmonomethod coefficient, but it isn't of interest
because it doesn't provide information about the
interpersonal assertiveness test. Also, the number
of correlation coefficients that can provide
evidence of convergent and discriminant validity
depends on the number of measures included in
the matrix. In the example, only four measures
were included (the minimum number), but there
could certainly have been more.
RELIABILITY AND VALIDITY
Example: Three of the correlations in the above multitraitmultimethod matrix are relevant to the construct validity of the
interpersonal assertiveness test. The correlation between the
assertiveness test and the assertiveness rating (monotraitheteromethod coefficient) is .71. Since this is a relatively high
correlation, it suggests that the test has convergent validity.
The correlation between the assertiveness test and the
aggressiveness test (heterotrait-monomethod coefficient) is .13
and the correlation between the assertiveness test and the
aggressiveness rating (heterotrait-heteromethod coefficient) is
.04. Because these two correlations are low, they confirm that
the assertiveness test has discriminant validity. This pattern of
correlation coefficients confirms that the assertiveness test has
construct validity. Note that the monotrait-monomethod
coefficient for the assertiveness test is .93, which indicates that
the test also has adequate reliability. (The other correlations in
the matrix are not relevant to the psychologist's validation
study because they do not include the assertiveness test.)
RELIABILITY AND VALIDITY
A1
B1
A2
B2
Assertiveness
Test
Aggressiveness
Test
Assertiveness
Rating
Aggressiveness
Rating
A1
rA1A1 (.93)
B1
rB1A1 (.13) rB1B1 (.91)
A2
rA2A1 (.71) rA2B1 (.09) rA2A2 (.86)
B2
rB2A1 (.04) rB2B1 (.68) rB2A2 (.16) rB2B2 (.89)
RELIABILITY AND VALIDITY
Construct Validity
2. Factor Analysis: Factor analysis is used for
several reasons including identifying the
minimum number of common factors required to
account for the intercorrelations among a set of
tests or test items, evaluating a test’s internal
consistency, and assessing a test’s construct
validity. When factor analysis is used in the latter
purpose, a test is considered to have construct
(factorial) validity when it correlates highly only
with the factor(s) that it would be expected to
correlate with.
DESCRIPTIVE STATISTICS
Descriptive Statistics
Descriptive statistics are used to describe or summarize a
distribution (set) of data. Descriptive techniques include:
 tables,
 graphs,
 measures of central tendency, and
 measures of variability.
DESCRIPTIVE STATISTICS
A set of data can be organized in a table known as a
frequency distribution. Frequency distributions are
constructed by summarizing the data in terms of the
number (frequency) of observations in each category,
score, or score interval. In Study # 1, the academic
achievement tests scores of 25 children with ADHD could
be summarized as shown in Table 1. The column labeled
"Frequency (f) indicates the number of observations in
each score interval: Three of the 25 children received a
score between 80 and 100, while five received a score
between 60 and 79.
DESCRIPTIVE STATISTICS
Table 1 also includes a "Cumulative Frequency (cf)" column. The
cumulative frequencies indicate the total number of observations that
fall at or below each category or score. The information in Table 1
indicates that 2 of the 25 children received scores of 19 or below. 5
received scores of 39 or below and so on.
Table 1
Cumulative
Score
Frequency (f) Frequency (cf)
80- 100
3
25
60-79
5
22
40-59
12
17
20-39
3
5
0-19
2
2
DESCRIPTIVE STATISTICS
The information presented in a table can also be
presented in a graph. Bar graphs, histograms,
and frequency polygons are three types of
graphs. The choice of a graph depends on the
scale of measurement: Bar graphs are used when
the data represent a nominal or ordinal scale,
while histograms and frequency polygons are
used with interval or ratio data.
DESCRIPTIVE STATISTICS
Shapes of distribution
 Normal curve (mean, mode, median fall on the same
point)
 Leptokurtic distribution (more peaked than the normal
curve)
 Platykurtic distribution (flatter than the normal curve)
 Positive skewed distribution (the tail is extended to the
positive side of the distribution—i.e., most of the scores
are in the negative side)—mode<median<mean.
 Negative skewed distribution (the opposite characteristics
of the positive skewed distribution)—mean < median <
mode.
DESCRIPTIVE STATISTICS
Measure of central tendency
 Mean: the arithmetic average
 Mode: the most frequently occur score(s).
 Median: the middle score.
DESCRIPTIVE STATISTICS
Measure of variability
 Range: Max score – Min score.
 Variance (Mean Square): S2=SS/(N-1)=(X-M)2/(N-1)
[the denominator is N-1 for the sample variance—this is
because the sample variance tend to underestimate the
population variance because one subject score cannot be
freely varied.]
 Standard deviation is computed by taking the square root
of the variance.
 Normal distribution: M+ 1 SD (68.26%); 2 SD (95.44%); 3
SD (99.72%)
DESCRIPTIVE STATISTICS
Effect of math. Operations on measures of central
tendency and variability: Add/subtract constant to every
score: the central tendency score will change but not the
variability. Multiply/divide by a constant will change both
central tendency score and variability.
INFERENTIAL STATISTICS
Inferential Statistics
While descriptive statistics are used to summarize data,
inferential statistics are used to make inferences about a
population based on data collected from a sample drawn
from that population and to do so with a pre-defined
degree of confidence. In this section, the concept of
statistical inference is explained. In Section IV, specific
inferential statistical tests are described.
INFERENTIAL STATISTICS
The Logic of Statistical Inference
The techniques of statistical inference allow an
investigator to make inferences about the relationships
between variables in a population based on relationships
observed in a sample.
INFERENTIAL STATISTICS
For example, the psychologist in Study # 1 will want to
determine if there is a relationship between training in the
self-control procedure and scores on an academic
achievement test for all children who have received a
diagnosis of ADHD. Since the psychologist won't have
access to the entire population of children with this
disorder, he will evaluate the effects of the self-control
procedure on a sample of children drawn from the target
population. The psychologist will then use an inferential
statistical test to analyze the data he collects from the
sample, and results of the test will enable him to make an
inference about the effects of the procedure on the
achievement test scores for the population of children with
ADHD. Inferential statistical tests accomplish this task
through the use of a sampling distribution.
INFERENTIAL STATISTICS
Sampling Distributions
1. Population Parameters and Sample Statistics: To
understand inferential statistics, it is necessary to first
distinguish between sample values and population values.
As noted above, when conducting a research study, an
investigator does not have access to the entire population
of interest but, instead, estimates population values based
on obtained sample values. In other words, an investigator
uses a sample statistic to estimate a population
parameter. Sample statistics and population parameters
are designated with different symbols:
INFERENTIAL STATISTICS
Value
Population
Parameter
Mean
Standard
Deviation
Variance


2
Sample Statistic
M
SD
S2
INFERENTIAL STATISTICS
2. Characteristics of Sampling Distributions: Due to the
effects of random (chance) factors, it is unlikely that any
sample will perfectly represent the population from which it
was drawn. As a result, an estimate of a population
parameter from a sample statistic is always subject to
some inaccuracy. Because of the effects of sampling error,
sample statistics deviate from population parameters and
from statistics obtained from other samples drawn from the
same population.
INFERENTIAL STATISTICS
The relationship between sample statistics and a
population parameter can be described in terms of a
sampling distribution, which is a frequency distribution of
the means or other sample values of a very large number
of equal-sized samples that have been randomly selected
from the population. Keep in mind that a sampling
distribution is not a distribution of individual scores but a
distribution of sample statistics. A sampling distribution is
important in inferential statistics because it allows a
researcher to determine the probability that a sample
having a particular mean or other value could have been
drawn from a population with a known parameter.
INFERENTIAL STATISTICS
To better understand what a sampling distribution is, assume that the
psychologist in Study # 1 defines his population as "all children in the
6th grade who have received a diagnosis of ADHD," and, for that
population, an academic achievement test has a mean of 50 and a
standard deviation of 10. The psychologist repeatedly selects random
samples of 25 children from this population; and, for each sample he
administers the achievement test and calculates the mean score. The
psychologist has collected a set of sample means and finds that, while
some of the sample means are equal to the population mean (50),
because of the effects of sampling error, some means are larger than
the population mean and some are smaller. In fact, the psychologist
finds that his distribution of sample means, or sampling distribution of
the mean, resembles the distribution depicted in Figure 7. As shown in
that figure, the sampling distribution of the mean is normally shaped
and its mean is equal to the population mean of 50.
INFERENTIAL STATISTICS
Researchers do not actually construct a sampling
distribution of the mean by obtaining a large number of
samples and calculating each sample's mean. Instead,
they depend on probability theory to tell them what a
sampling distribution would look like. The sampling
distribution defined by probability theory is called a
theoretical sampling distribution, and it is based on the
assumption that an infinite number of equal-sized samples
have been randomly drawn from the same population.
INFERENTIAL STATISTICS
The characteristics of a sampling distribution of the mean
are specified by the Central Limit Theorem, which makes
the following predictions: (a) Regardless of the shape of
the distribution of individual scores in the population, as
the sample size increases, the sampling distribution of the
mean approaches a normal distribution; (b) The mean of
the sampling distribution of the mean is equal to the
population mean; (c) The standard deviation of the
sampling distribution of the mean is equal to the
population standard deviation divided by the square root of
the sample size:
SEM=/(N)
INFERENTIAL STATISTICS
The standard deviation of a sampling distribution of the
mean is known as the standard error of the mean. It
provides an estimate of the extent to which the mean of
any one sample randomly drawn from a population can be
expected to vary from the population mean as the result of
sampling error. In other words, like other standard
deviations, it is a measure of variability, but it is a measure
of variability that is due to the effects of random error. The
formula for SEM indicates that the size of the standard
error of the mean is affected by the population standard
deviation and the sample size (N): The larger the
population standard deviation and the smaller the sample
size, the larger the standard error and vice versa.
INFERENTIAL STATISTICS
For the above example, the population standard deviation
for the achievement test is 10 and the sample size is 25.
Using Formula 4, we can determine that the standard error
of the mean in this situation is equal to 2:
For Study # 1, the Central Limit Theorem predicts that the
sampling distribution of the mean is normally shaped, that
its mean is equal to 50, and that its standard deviation is
equal to 2.
INFERENTIAL STATISTICS
Note that, if the sample size had been 9 instead of 25, the
standard error would increase to 3.33 (10 divided by the
square root of 9 = 10/3 = 3.33). In other words, the smaller
the sample size, the larger the standard error of the mean.
One implication of this is that the smaller the size of the
sample, the greater the probability for error when using a
sample statistic to estimate a population parameter.
Another implication is that, for any given population, there
is a ''family'' of sampling distributions, with a different
distribution for each sample size.
INFERENTIAL STATISTICS
Although this discussion of sampling distributions has focused on the
sampling distribution of the mean, a sampling distribution can actually
be derived for any sample statistic. A sampling distribution can be
obtained for standard deviations, proportions, correlation coefficients,
the difference between means, and so on. In each case, the basic
characteristics of the sampling distribution are similar to those of the
sampling distribution of the mean.
The sampling distribution is the foundation of inferential statistics. It is
the sampling distribution that enables a researcher to make inferences
about the relationships between variables in the population based on
obtained sample data. How this is done is described in the next
section.
INFERENTIAL STATISTICS
Analyzing the Data and Making a Decision:
After stating the null and alternative hypotheses and
collecting the sample data, an investigator analyzes the
data using an inferential statistical test such as the t-test or
analysis of variance. The choice of a statistical test is
based on several factors including the scale of
measurement of the data to be analyzed.
The inferential statistical test yields a t, an F, or other value
that indicates where the obtained sample statistic falls in
the appropriate sampling distribution. That is, the test
indicates whether the statistic is in the rejection region or
the retention region of the sampling distribution:
INFERENTIAL STATISTICS
The rejection region, or "region of unlikely values," lies in
one or both tails of the sampling distribution and contains
the sample values that are most unlikely to occur simply
as the result of sampling error. (The rejection region is
also known as the critical region.)
The retention region, or "region of likely values," lies in the
central portion of the sampling distribution and consists of
the values that are likely to occur as a consequence of
sampling error only.
INFERENTIAL STATISTICS
When the results of the statistical test indicate that the
obtained sample statistic is in the rejection region of the
sampling distribution, the null hypothesis is rejected and
the alternative hypothesis is retained. The investigator
concludes that the sample statistic is not likely to have
been obtained by chance alone and that the independent
variable has had an effect on the dependent variable.
Conversely, if the statistical test indicates that the sample
statistic lies in the retention region of the sampling
distributionb -the null hypothesis is retained and the
alternative hypothesis is rejected. In this case, the
investigator concludes that the independent variable has
not had an effect and that any observed effect is due to
error.
INFERENTIAL STATISTICS
Example
In Study # 1, if the children who receive training in the self-control
procedure obtain a mean of 60 on the achievement test following
training, the psychologist would use an inferential statistical test to
determine whether the mean of 60 is due to error or to the procedure.
If the results of the test indicate that a mean of 60 is in the retention
region of the appropriate sampling distribution, the psychologist will
conclude that the procedure does not have an effect and that the
observed effect simply reflects error. Conversely, if the statistical test
indicates that a mean of 60 is in the rejection region, the psychologist
will conclude that the self-control procedure does, in fact, have a
beneficial effect on achievement test scores.
INFERENTIAL STATISTICS
Alpha: The size of the rejection region is defined by alpha
(a), or the level of significance. If alpha is .05, then 5% of
the sampling distribution represents the rejection region
and the remaining 95% represents the retention region.
The rejection region is always placed in one or both tails of
the sampling distribution; that is, in that portion of the
sampling distribution that contains the values that are least
likely to occur as the result of sampling error only. The
value of alpha is set by an experimenter prior to collecting
and/or analyzing the data. In other words, it is the
experimenter who decides what proportion of the sampling
distribution will represent the region of unlikely values. In
psychological research, alpha is commonly set at .05 or
.01.
INFERENTIAL STATISTICS
When the results of an inferential statistical test indicate
that the obtained sample statistic lies in the rejection
region of the sampling distribution, the study's results are
said to be statistically significant. For example, when alpha
has been set at .05 and the statistical test indicates that
the sample value is in the rejection region, the results of
the study are "significant at the .05 level."
INFERENTIAL STATISTICS
One- versus Two-Tailed Tests: Some inferential
statistical tests can be conducted as either a one- or twotailed test. When a two-tailed test is used, the rejection
region is equally divided between the two tails of the
sampling distribution. If alpha is set at .05, 2.5% of the
rejection region lies in the positive tail of the distribution
and 2.5% lies in the negative tail. With a one-tailed test,
the entire rejection region is placed in only one of the tails.
The division of the sampling distribution for one- and twotailed tests when alpha has been set at .05 is illustrated in
the following figure.
INFERENTIAL STATISTICS
It is the alternative hypothesis that determines whether a
one- or a two-tailed test should be conducted. A two-tailed
test is used when the alternative hypothesis is
nondirectional, while a one-tailed test is used when the
alternative hypothesis is directional. If a directional
alternative hypothesis predicts that the sample statistic will
be greater than the value specified in the null hypothesis,
the entire rejection region lies in the positive tail of the
sampling distribution. If a directional alternative
hypothesis predicts that the sample statistic will be less
than the value specified in the null hypothesis, the
rejection region is located in the negative tail.
INFERENTIAL STATISTICS
Decide, on the basis of the results of the statistical test,
whether to retain or reject the statistical hypotheses.
INFERENTIAL STATISTICS
Decision Outcomes: Regardless of whether an
experimenter decides to retain or reject the null
hypothesis, there are two possible outcomes of
his or her decision: The decision can be either
correct or in error, and an experimenter can never
be entirely certain which type of decision has
been made.
INFERENTIAL STATISTICS
Decision Errors: There are two decision errors, a Type I
error and a Type II error. A Type I error occurs when an
investigator rejects a true null hypothesis. For example, if
the psychologist in Study # 1 concludes that the selfcontrol procedure increases achievement test scores but
the apparent improvement in scores is actually a
consequence of sampling error, the psychologist has
made a Type I error. (Keep in mind that it is unlikely -- but
not impossible -- to obtain a sample value in the rejection
region of the sampling distribution as the result of chance
alone.)
INFERENTIAL STATISTICS
The probability of making a Type I error is equal to alpha. As the value
of alpha increases, the probability of rejecting a true null hypothesis
also increases. Increasing the value of alpha from .01 to .05, for
example, increases the probability of making a Type I error from 1
chance in 100 to 1 chance in 20. Because an investigator sets the
value of alpha, he or she has control over the probability of making a
Type I error. Note that the chance of making a Type I error is affected
by other factors. It may be increased, for example, when the sample
size is small or when observations are dependent. The other decision
error, a Type II error, occurs when an investigator retains a false null
hypothesis. In Study # 1, if the psychologist concludes that the selfcontrol procedure does not improve achievement test scores when it
actually does, the psychologist has made a Type II error.
INFERENTIAL STATISTICS
The probability of making a Type II error is equal to beta
(). Although the value of beta is not set by an investigator
and cannot be directly calculated for a particular study, the
probability of making a Type II error can be indirectly
influenced: A Type II error is more likely when the value of
alpha is low, when the sample size is small, and when the
independent variable is not administered in sufficient
intensity.
INFERENTIAL STATISTICS
There is an inverse relationship between Type I and Type II errors: As
the probability of making a Type I error increases, the probability of
making a Type II error decreases and vice versa. Consequently, the
selection of a level of significance depends, in part, on the seriousness
of making these two errors. For some research, a Type I error
(rejecting a true null hypothesis) is considered more problematic. In
these situations, the experimenter will choose a level of significance
that minimizes the probability of making a Type I error (e.g., .01 rather
than .05). In other situations, it is more important to avoid making a
Type II error (retaining a false null hypothesis). When this is the case,
a larger level of significance is preferred (.10 or .05 rather than .01).
INFERENTIAL STATISTICS
Correct decisions: There are also two possible correct
decisions. An investigator can make a correct decision by
retaining the null hypothesis when it is true or by rejecting
the null hypothesis when it is false.
It is the second type of correct decision that an
experimenter ordinarily wants to make. When a statistical
test enables an experimenter to reject a false null
hypothesis, the test is said to have statistical power.
Obviously, researchers want to maximize power whenever
they conduct a research study.
INFERENTIAL STATISTICS
Methods to Maximize Power

Increasing alpha: A null hypothesis (true or false) is
more likely to be rejected when alpha is .05 than when
it is .01.

Increasing sample size: A correct decision is more likely
to be made when the sample size is 50 than when it is
25. The effects of increasing sample size on power are
greatest when the sample is small. (When there are
100 or more subjects in each group, adding more
subjects does not have a substantial impact on power.)
INFERENTIAL STATISTICS

Increasing the effect size: Maximizing the effects of the IV
increases the likelihood that the effects will be detected. The
effects of the IV are maximized by administering the IV for a long
enough period of time or in sufficient intensity.

Minimizing error: When potential sources of systematic and
random error are controlled, it is easier to detect the effects of the
independent variable. One way to reduce error is to make sure the
DV measure is reliable. Another way is to reduce within-group
variability by controlling extraneous variables or by using a withinsubjects design.
INFERENTIAL STATISTICS

Using a one-tailed test when appropriate: A one-tailed
test is more powerful than a two-tailed test as long as it
is appropriately used.

Using a parametric test: Parametric statistical tests,
such as the t-test or ANOVA, are more powerful than
nonparametric tests.
INFERENTIAL STATISTICS
Note that "power" is not the same as "confidence." Power
refers to the ability to reject a false null hypothesis and, as
noted above, is affected by the size of alpha: Power
increases as alpha increases and vice versa. Statistical
power is something a researcher is concerned about
before a decision about the null hypothesis is made.
Confidence refers to the certainty a researcher has about
the decision he or she has already made about the null
hypothesis. An experimenter has more confidence that his
or her decision to reject the null hypothesis was correct
when alpha is small (e.g., .01 rather than .05).
INFERENTIAL STATISTICS
Exercise A
When a researcher makes the decision to retain or reject the (1)
________ hypothesis, there is no way to know with certainty if the
decision is correct or in error. There are two kinds of decision errors.
Type I error is made when a true null hypothesis is (2) ________ . This
occurs when a researcher concludes that an independent variable has
had an effect on the dependent variable, but the observed effect was
actually due to (3) ________ . The probability of making a Type I error
is equal to (4) ________ . For example, when alpha is set at .05 and
the researcher has rejected the null hypothesis, there is a (6)
________ % chance that a Type I error has been made.
INFERENTIAL STATISTICS
A Type II error is made when a false null hypothesis is (7)
________ . This occurs when the researcher decides that
an independent variable has no effect on the dependent
variable when it actually does. A Type II error might occur
when the (8) ________ variable was not administered in
sufficient intensity or for a long enough period of time,
when the sample size was too (9) ________ , or when
alpha is too (10) ________ .
INFERENTIAL STATISTICS
A researcher can, of course, make a correct decision. One
kind of correct decision is to (11) ________ a true null
hypothesis. In this situation, the researcher correctly
concludes that any observed effect of an IV is actually due
to (12) ________ . The other correct is to (13) ________ a
false null hypothesis. The researcher correctly that the
(14) ________ has an effect on the DV. When a statistical
test enables a researcher to make this kind of correct
decision, the test is said to have power.
INFERENTIAL STATISTICS
Power is increased as alpha (15) ________ , as the
sample size (16) ________ , and as the magnitude of
effect of the (17) ________ increases. Power is also
maximized when a (18) ________ –tailed test is used (if
appropriate) ________ and when the data are analyzed
using a t-test, ANOVA, or other (19) ________ statistical
test.
INFERENTIAL STATISTICS
Exercise B
A researcher compared the number of cavities of children who had
used either Toothpaste brand X or Toothpaste brand Y for a year. At
the end of the year, the researcher found that the children who had
used brand X had significantly fewer cavities than the children who
had used brand Y. The difference was significant at the .05 level.
1.
2.
3.
4.
5.
What is the null hypothesis?
What is the research hypothesis?
What would be the Type I error?
What would be the Type II error?
What is the probability of a Type I error?
Non-Experimental Research
Non-experimental Research
Non-experimental (descriptive) research is conducted
primarily to collect data about variables rather than to test
hypotheses about the relationships between them. In other
words, a non-experimental study is conducted to describe
"how things are." Observational studies, archival research,
correlational research, case histories and case studies,
and surveys are ordinarily non-experimental.
Non-Experimental Research
Non-experimental Research





Observational studies
Archival research
Correlational research
Case histories and case studies, and
Surveys
Non-Experimental Research
Observational Studies
Observational studies involve observing behavior in a
systematic way, often in a naturalistic setting. Naturalistic
field studies and participant observation are examples of
observational studies. An important decision that must be
made before conducting an observational study is how to
record the behavior of interest. One method is to obtain a
narrative record of the behavior as it actually occurred,
with the record taking the form of a detailed written
description or an audio and/or visual recording.
Non-Experimental Research

Content analysis, which involves organizing the data into
categories, can then be used to summarize and interpret the
information contained in the narrative record.

Protocol analysis can be viewed as a type of content analysis. It is
used by psychologists interested in the cognitive processes
("heeded cognitions") that underlie problem-solving and other
complex tasks and involves asking a subject to "think aloud" while
solving a problem. The subject's verbalizations are recorded, and
the protocol (record) is later coded in terms of relevant categories
such as intentions, cognitions, planning, and evaluations. To obtain
valid information when conducting a content analysis, the coding
(behavioral) categories must be clearly defined, exhaustive, and
mutually exclusive.
Non-Experimental Research
An alternative to obtaining a complete record of a behavior
is to look at specific aspects of it by employing a
systematic method for sampling and recording the
frequency or duration of the behavior and/or rating the
behavior in terms of its qualitative characteristics. Methods
of behavioral sampling include interval recording and
event sampling.
Non-Experimental Research

Interval recording, a type of time sampling, involves observing a
behavior for a period of time that has been divided into equal
intervals (e.g., a 30-minute period that has been divided into 1 5second intervals) and recording whether or not the behavior
occurred during each interval. Interval recording is especially
useful for studying complex interactions and behaviors that have
no clear beginning or end such as laughing, talking, or playing.

Event sampling (recording) entails observing a behavior each time
it occurs. This technique is good for studying behaviors that occur
infrequently, that have a long duration, or that leave a permanent
record or other product (e.g., a completed worksheet or test).
Non-Experimental Research

Situational sampling is an alternative to behavioral sampling and
is used when the goal of the study is to observe a behavior in a
number of settings. Situational sampling helps increase the
generalizability of a study's findings.

Sequential analysis entails coding behavioral sequences rather
than isolated behavioral events and is used to study complex
social behaviors (Bakeman and Gottman, 1989).
Non-Experimental Research
Case Studies
Although a case study is most associated with an in-depth description and
analysis of a single person, it can also entail an intensive investigation of
a single institution, agency, community, or other social unit. Ethnographic
research, which focuses on a single culture, is an example of a case
study that includes more than one person. Two shortcomings of case
studies are that (a) their results usually do not allow an investigator to
draw conclusions about the exact nature of the relationships between
variables (e.g., to determine if status on one variable causes status on
another variable) and (b) the information derived from a case study might
not be generalizable to other cases. Because of their limitations, case
studies are most useful for investigating rare phenomena and as
exploratory studies for identifying independent and dependent variables
and generating hypotheses about the relationships between them that
can be more systematically investigated in the future.
Non-Experimental Research
Surveys
A survey involves administering a questionnaire either in-person, by
phone, or through the mail. A serious problem with mail surveys is
their susceptibility to nonresponse biases that occur when the people
who fail to send back the survey differ in important ways from those
who return it. In general, the lower the overall response rate, the more
likely the survey's results will be affected by nonresponse biases.
Several techniques are useful for increasing the number of responses
to mail surveys and for reducing biases. Probably the most effective
method is to follow-up the initial contact with one or more mailings
(three follow-ups seem optimal). Other useful methods are including a
cover letter that provides relevant information (e.g., information about
the sponsoring agency and the purpose of the study); including a
small reward (larger rewards have not been found to be more effective
than smaller ones); and pre-contacting individuals by phone about the
questionnaire.