Download PrelimCompsFall2011_Main_Ideas

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
These are the questions for Masters Comps and Doctoral Prelims for Fall, 2011.
Please write your answers in the blue books. Mark each item number and section
clearly. Put your name on the outside of each blue book. Do NOT write your name
inside the books. Try to write neatly, please.
The proctor will pick up the booklets when you are done or when your time is over.
Turn off your phone. If you need a calculator, one will be provided you.
Best of luck,
Drs Lissitz, Jiao, and Stapleton
Statistics
1. This question is related to power and hypothesis testing.
a. Explain the following terms in hypothesis testing: Type I error rate, Type II error
rate, and power.
RUBRIC: Type I error rate – probability of rejecting the null hypothesis when the
null hypothesis is true.
Type II error rate – probability of failing to reject the null hypothesis when the
null hypothesis is false.
Power – the probability of rejecting the null hypothesis when the null hypothesis
is false.
Type II error rate and Power combined are 1.0
b. Use a graph to represent the relations among α, β, and power in each of the
following two hypothesis testing respectively: a two-tailed test and a one-tailed
test with alternative distribution falling at the upper end of the null distribution.
RUBRIC: Graph should have a null and alternate sampling distribution. Alternate
distribution should be to the right of the null (at least for part #2). Correct
lines/shading should be used to identify 2 equal-sized areas under the null
distribution (for part #1) and one area (equal to the combination of both parts) for
part #2 for α. Student should use two graphs – one for part #1 and one for part #2
-- for clarity. Each section should be correctly labeled (power, Type II and Type
I).
c. Specify the factors affecting the power in hypothesis testing and state each
factor’s impact on power (specify the change of direction in the factor).
RUBRIC:
Should specify:
 sample size– explain that the larger the sample, the smaller the variability in the
sampling distributions, the greater the power,
 effect size – explain that the greater the effect size, the greater the separation of
the sampling distributions, the greater the power,
 tails of hypo test – explain that if only one tail, the greater the power (as seen in
graph in 1b)
 Type I error rate – explain that the greater the rate, the more power
 Variance of the sampling distribution-the greater the variance of the sampling
distribution, the smaller the power
2. The county school board has instituted an incentive program to encourage middle
schools to improve their mathematics instruction. If a school’s 7th grade students score
on average above the 50th percentile on a standardized exam, then the school is awarded
$20,000 for equipment purchases. The standardized exam has a normed mean of 50 with
a standard deviation of 10.
The Park middle school participated in this program last year and the school board
randomly selected ten of the school’s sixty students to take the exam. On average, the
students earned a score of 48.2 and because this average was lower than 50, the school
did not receive the $20,000.
Statistically speaking, do you think it was appropriate that the Park middle school did not
receive the $20,000? Why or why not?
RUBRIC: The student should indicate that the sample mean is not necessarily reflective
of population mean, discussing the concept of sampling error. Excellent answer would
be conducting actual z-test (z = -1.8/3.16 = .57. p>.05) or creating a confidence interval,
but could have passing answer without doing the actual test(s). There is no evidence that
the population student mean is not at least 50.
Note that the question began with “statistically speaking” so there is no reason to get into
issues of validity… but a student might discuss the low stakes nature of the assessment
for the student and question the validity of the scores on this assessment. (But this would
be beyond the grading of this question).
3. A researcher is interested in investigating the effect of genres of writing prompts on
students’ writing achievement. Three genres were included in the study: narrative,
descriptive, and persuasive. Each student was required to complete all three prompts. To
reduce the ordering effect of prompts, the researcher assigned a representative sample of
the student population randomly to each of the possible orderings of the three writing
prompts. Further, the researcher believed that gender may interact with the genre of
writing prompts and so included it in the design of the study.
a. Suggest a proper analysis of variance (ANOVA) method for data analyses for this
research study. Be sure to reflect the nature of the analysis components.
RUBRIC: A good answer would suggest the name: split-plot design.
The student should indicate that there would be one repeated factor and that
gender would be treated as a between-subjects factor. Given the counterbalancing, the between-subjects ordering factor is not necessary to include (but it
is not problematic to include it as well).
b. What are the assumptions for the identified data analysis model? Explain each
assumption and suggest a method for checking potential violation of each
assumption.
RUBRIC:
Assumptions :
 Homogeneity of variance – the variability of the scores is the same in the
population across each of the genders. Levene’s test or some other such test
would be an appropriate assessment of the assumption; alternately, a suggestion to
examine plots of data would be appropriate.
 Normality of sampling distribution – the distribution of the scores is the
population is normal, within each cell of the design -- check skew, kurtosis,
Kolmogorow-Smirnov or Wilks/Shapiro, or depend on sample sizes (CLT) at
each measurement occasion


Sphericity – the variance of the difference scores between each pair of repeated
scores is the same -- Mauchly’s test can be used – if the assumption is violated,
use alternate df/test
Independence of observations – conditional on the design, scores are not
dependent -- check the research design description. (No information was
provided, however).
4. The standard error of the jth regression coefficient in a multiple regression model can
be computed as
S
SEˆ  Y
j
SX j
(1  R 2 )
( N  p  1)(1  R 2j )
where R2 is the proportion of variance accounted for by the entire model, SY and S X j
are the standard deviations of Y and the jth variable, Xj. R2j is defined as the
proportion of variance in the jth variable accounted for by the other predictors in the
model.
Using this formula, discuss in detail what affects the power of the t-test to detect a
significant finding when testing the jth regression coefficient.
RUBRIC: N – the larger the sample size (N), the greater the power, all other things
constant
p – the fewer the number of other predictors, the greater the power, all other things
constant R2j – we want the jth variable to share as little variance with the other variables
as possible, so R2j should be small (as close to zero as possible) to achieve the greatest
power, all other things constant – this measure represents collinearity.
R2 – we want maximal variance explained in the outcome variable, so as close to 1 as
possible, the higher the power, all other things constant
Research design
1. A researcher wants to examine the relation between amount of time spent on
homework and achievement in algebra. He is given two possible study designs to
consider.
Study A: 30 randomly-selected students enrolled in algebra classes in School A
would be assigned to do 30 minutes of homework per day and 30 randomlyselected students in algebra classes in School B would be assigned to do 2 hours
of homework per day. At the end of one month, the 60 students would be given a
standardized algebra exam. The average performance on the exam across the two
conditions would be compared using a t-test to determine the relation of time
spent on homework and algebra performance.
Study B: 60 randomly-selected algebra students across Schools A and B would be
surveyed and asked how much algebra homework they do on average per day.
Additional data about the students would be obtained on the survey. The 60
students would also be given a standardized algebra exam. To answer the research
question, the performance on the exam would be regressed on time on homework,
after controlling for other student characteristics believe to influence performance.
Recommend the study design that the researcher should use. In your recommendation, be
sure to discuss the advantages and disadvantages of each of the designs, specifically
addressing threats to internal validity. You may also want to suggest additional analyses
that the researcher should undertake to evaluate any assumptions that are being made in
selecting the study design.
RUBRIC: Selection of either study is fine.
Study A – the biggest threat to internal validity is likely differential selection
(disadvantage). Advantage is that manipulation of independent variable did occur. We
would want to evaluate the equivalence of student ability in each school prior to study
and examine equivalence of instruction. Possibly include any pre-existing differences
into an ANCOVA model. Possible external threat to validity, given that students may not
generally study 30 min vs 2 hours. (Study B might capture more natural values of study
time).
Study B – this is a correlational design, so has little internal validity; 3rd variable might
explain any relation between the two. Possibly may be more palatable to the school
administration instead of forcing study time on students. Additionally, will students be
truthful about responses regarding homework time? Design specifically includes
consideration of covariates….but there might be a treatment by selection effect…
2. You are consulted to design a comparability study to investigate the equivalence
between a paper-and-pencil test and its computer-based version for a large-scale state
math test. The researcher is interested in using either a matched pair design or a repeated
measures design.
a. Describe how to implement the study using each design. For the matched pair
design, be sure to include two matching variables which you think are relevant.
RUBRIC: For matched pair, student should identify how to select participants and how to
do the matching. Students should be randomly assigned within matched pairs to each of
the conditions. Student should explain the variables used for matching (any variables
suggested are reasonable here).
For repeated measures, student should explain the ordering (or counter balancing if
using) of the tests. Should explain how participants are selected and how the tests are
administered (time allowed between).
b. Discuss the pros and cons of each design. Make sure to refer to the potential
threats to internal validity and external validity.
RUBRIC: For repeated measures, clearly state testing effects may be problematic.
Counterbalancing might solve the problem.
Each subject serves as his/her own control
Subjects are exposed to every level of the treatment
Fewer subjects needed
threats to validity: carryover effects, sample attrition
Counterbalancing, but fatigue, time-consuming
Matched pair requires greater sample size and depends on the strength of matching
variables for equivalence. Random assignment within pair would alleviate any
differential selection issues.
After matching based on one or multiple variables, random assignment within the
matched group to treatment levels
Subjects are exposed to one level of the treatment
Reduce threats of fatigue
c. Recommend possible actions the researcher could take to reduce the threats.
RUBRIC: Use between-subject design
Use more matching variables
Assessment
1. Assume that you are running a carefully constructed assessment program that includes
a one-hour session with 50 test questions administered from a carefully constructed item
bank of a thousand questions. Suppose, further, that you were required to decrease the
testing to 45 minutes and therefore need to decrease the number of test items. Describe
three considerations that you would worry about. What are they and why would they
potentially be a problem? Is there anything you can do to minimize the problems
associated with each?
RUBRIC: Concerns should include
1) Content coverage
2) Difficulty level coverage
3) Loss of reliability in total score from assessmentTo address minimizing the problems, work with a test blueprint to address items 1 and 2.
For 3, Use CAT. Or, at least select those items with highest discrimination and/or
information across the range of theta that is desired to be measured.
Spearman brown equation referenced and how it might be used. Assumptions for its use.
Maintaining test design fidelity, changing the difficulty can be an issue and care with
matching difficulty to ability of test takers or the concern with where on the scale you are
focusing, Issue of how to actually select the items to eliminate.
Reliability
Inter-item correlation
Purpose of the test
2. Test item analysis utilizes a number of classical statistical calculations. Suppose you
had a set of multiple choice data from a test and each item was scored correct (1 point)
and incorrect (0 points). The total test score is measured both as the number correct and
as a pass (1) and fail (0) score. The following are four statistics that are often used to
analyze such items to decide if the item behavior is good or bad from the standpoint of
developing a test with high reliability:
Phi
Tetrachoric
Point Biserial
Biserial
Answer each of the questions:
a. Define each statistic
RUBRIC: Phi 2x2 equal to Pearson
Tetrachoric 2x2 with continuous and normal distribution assumptioin.
Point Biserial 2 by many, equal to Pearson
Biserial 2 by many, but the dichotomous is assumed to come from a normally distributed
continuous variable
a. What are the pros and cons of each of the statistics? Include any concerns with
the accuracy of each statistic as a measure of association.
RUBRIC: Issue of differential proportions decreasing correlation, unless you use
normality assumption and underlying continuous dimension .. more complex,
harder to describe to lay audience, tetrachoric and biserial are greater than or
equal than their related pair. Might talk about dichotomous data issues.
Assumptions are different
Measurement
1. Properly understanding item properties is vital in test form construction. Using the two
graphs below, answer the following questions.
a. Identify the most difficult item and the most discriminating item assuming the
items fit to a two-parameter unidimensional item response theory model.
RUBRIC: Most difficult = Item 2.
Most discrim = Item 3.
b. If two test forms are to be constructed to classify examinees with ability around 2
and 0.5 on the logit scale respectively, which two items should be included on
each form? State your reasons.
RUBRIC: For the form targeting at 2, Item 2 (blue) and item 1 (pink). For the form
targeting at 0.5, item 1 and 3. These two carry the most information at those two
locations.
1.0
Item 1
0.9
Item 2
0.8
Item 3
0.6
0.5
0.4
0.3
0.2
0.1
Theta
4.0
3.5
3.0
2.5
2.0
1.5
1.0
0.5
0.0
-0.5
-1.0
-1.5
-2.0
-2.5
-3.0
-3.5
0.0
-4.0
Probability
0.7
Figure 1. Item characteristic curves.
1.0
Item 1
0.9
Item 2
0.8
Item 3
Item Information
0.7
0.6
0.5
0.4
0.3
0.2
0.1
4.
0
3.
5
3.
0
2.
5
2.
0
1.
5
1.
0
0.
5
0.
0
-4
.0
-3
.5
-3
.0
-2
.5
-2
.0
-1
.5
-1
.0
-0
.5
0.0
Theta
Figure 2. Item information curves.
2. In the 3 parameter IRT item analysis, there are a number of statistics that are computed
to examine the quality of items. The following are 5 useful statistics:
The a parameter
The b parameter
The c parameter
Conditional Item information
Conditional Item error (standard error of measurement)
a. Define each statistic
RUBRIC: The a parameter: the discrimination parameter indicating how well an item
may distinguish between high and low-ability students. Proportional to the slope at the
steepest point on the item characteristic curve
The b parameter: item difficulty indicating the point on the ability scale with a probability
(1+c)/2 to answer an item correctly. When related to 1PL and 2PL IRT models, item
difficulty is the same point on the ability scale where examinees have 50% probability to
get the item correct.
The c parameter: the guessing parameter, the lower asymptote of the item characteristic
curve indicating the probability that lower ability students may get the item correctly
Conditional Item information: indicating the measurement precision at a certain ability
level, depends on the probability of an correct response and incorrect response and the
first derivative of the probability of an correct response with respect to theta, determined
by all item parameters involved in a particular IRT model conditional on theta.
Conditional Item error (standard error of measurement): an index for measurement
precision/imprecision. It is the square root of the inverse of the conditional item
information conditional on theta.
a. Explain how (in what context) each statistic is useful or not useful. . Talk about
the usefulness of each. A is quality of item, b is where the item is most likely
useful, c gives us an idea how useless the item is for low ability examinees, item
information for planning and for making decisions. Information and SE are
related – explain how….
b. If you were interested in designing a test so that it is the most useful with a certain
ability level of student, which one(s) of these statistics would you pay closest
attention to and why? Information or item error statistics or the b parameter. Pick
one and defend its use. Talk about targeting the design of a test..
3. Validity is often classified into three types – content, criterion and construct. Suppose
you were going to use a test to select workers in a company and six months later you
were going to measure their success on the job using a criterion measure. You plan to
compare each employee’s initial test results to his or her criterion measure. What
concerns might you have with this study? Be sure to talk about issues that would impact
on criterion validity in this context. How do you think this approach to validating the
test for selection compares to content validity? Talk about the differences between these
two approaches to validating a selection test in a work environment.
RUBRIC: The existence of a criterion and the utilitarian nature of the study makes it a
natural criterion validity study. Concerns include the quality of the criterion and the
degree to which it is really an indication of the success of an employee. How much
learning occurs on the job so that the entry exam becomes more irrelevant.
Scale truncation, restriction of range issues.
4. An evaluation team has created a scale to assess the satisfaction of trainees in a
workshop. The following scale was created: 1 means the workshop was much too hard,
2 means it was too hard. Three means the workshop was just right. 4 means it was too
easy and 5 means it was much too easy. Discuss this scale design. Does it pose any
issues for using the results in a correlation study that the researchers were planning to
conduct? If you were asked to improve the design of the scale, what would you suggest
and why? If you like the scale the way it is, tell us why you like it. What level of
measurement scale is it and why (nominal, ordinal, interval or ratio)? What impact does
the scale of measurement have upon data analysis? Be specific about the importance of
the scale upon an analysis.
RUBRIC: Note that the scale is not monotonic. 3 is the best number and 1 and 5 are
worse, so you can’t use the scale as is for any statistical work. Suggest a fix, such as
scale distance from 3 being the transformation, like an unfolding scale. The transformed
scale is more ordinal than interval, although you might treat it as interval for practical
purposes.
Not good for correlation analysis using Pearson product moment correlation which
requires at least interval scaled variables.
May use multiple items to measure the different facets of trainees’ satisfaction to get a
sum score which is continuous.