Writing, Selecting, and Administering Tests
Psychology of Testing & Measurements
Lecture, Chapters 6 & 7
Item Writing
The nature of a test, its objectives and purposes, dictates what types of
questions may be constructed.
Variables of measurement interest must be clearly defined.
Items must be clear and concise, at the appropriate level for the population,
and void of bias.
Item Formats
Dichotomous - two alternatives (forced choice)
Polytomous - more than two alternatives
Likert - rating scale (strongly agree, etc.)
Category - on a scale of one to ten
Checklists and Q-sorts - choose best fit from a long list of
adjectives
Dichotomous Item Format
Advantages
Simple to answer, administer, and score
Requires absolute judgment
Disadvantages
Oversimplification
Memorization without comprehension
50% chance of guessing correctly
Polytomous Item Format
Importance of good distractors
Issue of guessing
Corrected score (C) = R – __w__ = 27 – _3__ = 26
n - 1
4 - 1
R = # right responses
w = # wrong responses
n = # choices for each item
Advantages
Easy to administer and score
Chance of guessing correctly reduced to 20-25%
Takes less time and can cover large amounts of material
Likert and Category Formats
Used as part of Likert’s (1932) method of attitude scale construction; most
popular format in current measures
Consists of several alternative choices on a continuum for participants to rate
themselves on attitude or personality
Number of choices can permit or prevent neutrality
Category format: Increases # choices to 9 or 10 (beyond that may reduce
reliability)
Effect of Context
The numbers we assign are found to be affected by context (Parducci, 1968).
There is a tendency to spread responses evenly across 10 categories.
How immoral are acts? Students rated List 1 (mild actions) and List 2 (severe
actions
Bawling out servants publicly (2.64 vs. 2.39)
Poisoning a neighbor’s dog whose barking bothers you (4.19
vs. 3.65)
Pocketing the tip the previous customer left the waitress
(3.32 vs. 2.46)
Publishing under your own name an investigation originated
and carried out without remuneration by a graduate student working under you
(3.95 vs. 3.47)
Failing to put back in the water lobsters shorter than the
legal limit (2.22 vs. 1.82)
Habitually borrowing small sums of money from friends and
failing to return them (2.93 vs. 2.37)
Implications: clearly define endpoints of the scale; use extreme caution in
comparing responses outside of the current study
Item Analysis
Item analysis = evaluating individual test items through assessments of
difficulty and discriminability
Item difficulty - % participants who get a particular item correct (40% answer
correctly = .4 difficulty; could indicate easiness rather than difficulty).
Recommended “difficulty” is halfway between level of success by chance alone and
100% responding correctly.
Item discriminability – assessment of whether the participants who have done
well on particular items have also done well on the whole test
Extreme group method – those who do well or poorly
Point Biserial method – individual items compared to overall test
A good test item will discriminate at all levels; # of participants who answer
the item correctly increases as test score increases.
Items for Criterion-Referenced Tests
Item development is based on learning goals and program objectives, not the
performance of peers.
Compare those who have participated in program with those who have not and
identify a cutting score.
Use the cutting score as minimum criteria for meeting objectives.
Item Analysis vs. Criterion-Referenced Tests
Item analysis:
Too much focus on comparing scores with other students
Too little focus on eliminating specific weaknesses; 40% of
children have been found to repeat the same types of errors.
Criterion-referenced tests:
Certain skills, which are easy to test, are “overly covered”
to meet specific criteria.
Important skills, such as critical thinking, are not focused
upon.
Item Response
Item Response Theory (IRT) – approach to testing based on analysis of a
participant’s chance of correctly answering an item.
In IRT, sample items are given to a participant and a specific range that
challenges the participant is identified.
Participant is then given questions for which he/she specifically has a 50%
chance of answering correctly.
Scores are based on level of difficulty (as opposed to number) of correctly
answered items.
Computer based administration, which leads to increased measurement precision
(less error).
Selecting Tests
If a test is expensive and time-consuming or difficult to administer, one must
ask what the test might reveal beyond information obtained in some simpler
manner
Does the test give you more information than you could find if it were not used?
If so, how much more information does it gives?
Test Administration: Examiner and Participant
Many aspects of the interaction between examiner and participant can have
potential effects on participant performance.
Relationship between examiner and participant
Race of the examiner
Language of participant
Training of examiners
Expectancy Effects
Effects of reinforcing responses
Computer-assisted test administration
Advantages: standardized, can be individually tailored by
sequence, precision of timing responses, control of bias, can accommodate the
pace of the participant
Disadvantages: technical problems, must be constructed
properly, results must be interpreted by an experienced psychologist (not really
disadvantages)
Participant variables
Test anxiety (worry, emotionality, and lack of self-confidence)
Physical illness
Locating Information about Tests
Significant early books
The Principles of Teaching : Based on Psychology
Clinical Psychiatry (1907), Choosing a Vocation (1909)
Mental Measurements Yearbook (1938)
Test Critiques (1984)
Dictionary of Occupational Titles (1939)
Diagnostic & Statistical Manual of Mental Disorders (1952)
Standards for Educational & Psychological Tests & Manuals
(1966)