Writing, Selecting, and Administering Tests
Psychology of Testing & Measurements
Lecture, Chapters 6 & 7

Item Writing
The nature of a test, its objectives and purposes, dictates what types of questions may be constructed.
Variables of measurement interest must be clearly defined.
Items must be clear and concise, at the appropriate level for the population, and void of bias.

Item Formats
    Dichotomous - two alternatives (forced choice)
    Polytomous - more than two alternatives
    Likert - rating scale (strongly agree, etc.)
    Category - on a scale of one to ten
    Checklists and Q-sorts - choose best fit from a long list of adjectives

Dichotomous Item Format
Advantages
    Simple to answer, administer, and score
    Requires absolute judgment

Disadvantages
    Oversimplification
    Memorization without comprehension
    50% chance of guessing correctly

Polytomous Item Format

Importance of good distractors

Issue of guessing
Corrected score (C) = R – __w__ = 27 – _3__ = 26
                                           n - 1              4 - 1
R = # right responses
w = # wrong responses
n = # choices for each item

Advantages
    Easy to administer and score
    Chance of guessing correctly reduced to 20-25%
    Takes less time and can cover large amounts of material

Likert and Category Formats
Used as part of Likert’s (1932) method of attitude scale construction; most popular format in current measures

Consists of several alternative choices on a continuum for participants to rate themselves on attitude or personality

Number of choices can permit or prevent neutrality

Category format: Increases # choices to 9 or 10 (beyond that may reduce reliability)

Effect of Context
The numbers we assign are found to be affected by context (Parducci, 1968).

There is a tendency to spread responses evenly across 10 categories.

How immoral are acts? Students rated List 1 (mild actions) and List 2 (severe actions
    Bawling out servants publicly (2.64 vs. 2.39)
    Poisoning a neighbor’s dog whose barking bothers you (4.19 vs. 3.65)
    Pocketing the tip the previous customer left the waitress (3.32 vs. 2.46)
    Publishing under your own name an investigation originated and carried out without remuneration by a graduate student working under you (3.95 vs. 3.47)
    Failing to put back in the water lobsters shorter than the legal limit (2.22 vs. 1.82)
    Habitually borrowing small sums of money from friends and failing to return them (2.93 vs. 2.37)

Implications: clearly define endpoints of the scale; use extreme caution in comparing responses outside of the current study

Item Analysis

Item analysis = evaluating individual test items through assessments of difficulty and discriminability
Item difficulty - % participants who get a particular item correct (40% answer correctly = .4 difficulty; could indicate easiness rather than difficulty).
Recommended “difficulty” is halfway between level of success by chance alone and 100% responding correctly.

Item discriminability – assessment of whether the participants who have done well on particular items have also done well on the whole test
Extreme group method – those who do well or poorly
Point Biserial method – individual items compared to overall test
A good test item will discriminate at all levels; # of participants who answer the item correctly increases as test score increases.

Items for Criterion-Referenced Tests
Item development is based on learning goals and program objectives, not the performance of peers.

Compare those who have participated in program with those who have not and identify a cutting score.

Use the cutting score as minimum criteria for meeting objectives.

Item Analysis vs. Criterion-Referenced Tests
Item analysis:
    Too much focus on comparing scores with other students
    Too little focus on eliminating specific weaknesses; 40% of children have been found to repeat the same types of errors.

Criterion-referenced tests:
    Certain skills, which are easy to test, are “overly covered” to meet specific criteria.
    Important skills, such as critical thinking, are not focused upon.

Item Response

Item Response Theory (IRT) – approach to testing based on analysis of a participant’s chance of correctly answering an item.
In IRT, sample items are given to a participant and a specific range that challenges the participant is identified.
Participant is then given questions for which he/she specifically has a 50% chance of answering correctly.
Scores are based on level of difficulty (as opposed to number) of correctly answered items.
Computer based administration, which leads to increased measurement precision (less error).

Selecting Tests
If a test is expensive and time-consuming or difficult to administer, one must ask what the test might reveal beyond information obtained in some simpler manner

Does the test give you more information than you could find if it were not used?

If so, how much more information does it gives?

Test Administration: Examiner and Participant
Many aspects of the interaction between examiner and participant can have potential effects on participant performance.

    Relationship between examiner and participant

    Race of the examiner

    Language of participant

    Training of examiners

    Expectancy Effects

    Effects of reinforcing responses

Computer-assisted test administration

    Advantages: standardized, can be individually tailored by sequence, precision of timing responses, control of bias, can accommodate the pace of the participant

    Disadvantages: technical problems, must be constructed properly, results must be interpreted by an experienced psychologist (not really disadvantages)

Participant variables

Test anxiety (worry, emotionality, and lack of self-confidence)

Physical illness

Locating Information about Tests

Significant early books
    The Principles of Teaching : Based on Psychology
    Clinical Psychiatry (1907), Choosing a Vocation (1909)
    Mental Measurements Yearbook (1938)
    Test Critiques (1984)
    Dictionary of Occupational Titles (1939)
    Diagnostic & Statistical Manual of Mental Disorders (1952)
    Standards for Educational & Psychological Tests & Manuals (1966)