Ordinal Data: Mann-Whitney, Wilcoxon, Kruskal Wallis, and Friedman Tests Week 14 – Chapter 20

Review of Comparing Means
2 group studies:
    Independent measures = independent measures t test
    Repeated measures = repeated measures t test

3+ group studies:
    Independent measures = independent measures ANOVA
    Repeated measures = repeated measures ANOVA

Note: the above designs are not measuring relationship (how they are similar), as in correlational studies, but instead are measuring significant differences (how they differ)

Why use ranks?
Ranks are simpler.

The original scores may violate some of the basic assumptions that underlie certain statistical procedures.

The original scores may have unusually high variance.

Occasionally, an experiment produces an undetermined, or infinite, score.

Statistical Tests for Ordinal Data
The scores are first listed in order, including tied values.

A rank is assigned to each score in the list.

When two or more scores are tied, their assigned ranks are averaged and each score is assigned the average of the tied ranks.

Types of Tests
Mann-Whitney: evaluates differences between 2 treatment conditions or 2 populations; alternative to independent-measures t-test.

Wilcoxon: evaluates differences between 2 treatment conditions from a repeated-measures design; alternative to repeated-measures t-test.

Kruskal-Wallis: evaluates differences among 3+ treatment conditions; alternative to single-factor, independent measures ANOVA.

Friedman: evaluates differences among 3+ treatment conditions from a repeated-measures design; alternative to repeated-measures ANOVA.

The Mann-Whitney test is designed to use the data from two separate samples to evaluate the difference between two treatments (or 2 populations).

The calculations require that individuals be rank-ordered; this can be done both with or without separate scores.

If a real difference exists for 2 treatments, the scores from the samples should be concentrated at opposite ends of the distribution; if no real difference, they will be intermixed.

Mann-Whitney Hypotheses
Ho: There is no difference between the two treatments. Therefore, there is no tendency for the ranks in one treatment condition to be systematically higher or lower than the other.

H1: There is a difference between the two treatments. Therefore, the ranks in one treatment condition are systematically higher or lower than the other.

Mann-Whitney U-Test
For the Mann-Whitney U, the two samples are combined and all the individuals are rank ordered.

Based on the ranks, a U value is computed for each sample
Each individual in Sample A gets one point whenever he or she is ranked ahead of an individual from Sample B.
The total number of points accumulated for Sample A is UA. In the same way UB is also computed.

the Mann-Whitney U is the smaller of the two U values.
You can check your calculations by confirming that
UA + UB = nAnB

Computing U for Large Samples
Due to the tedium involved in counting p0ints in large samples, a formula can be used.

First, combine the samples and rank-order all individuals (same as previously).

Second, find ∑RA (sum of ranks for individuals in sample A) and ∑RB (sum of ranks for B)

Then, the U value for both samples is computed as follows:
UA = nAnB + nA(nA + 1) - ∑RA
                               2
UB = nAnB + nB(nB + 1) - ∑RB
                               2

Again, the Mann-Whitney U value is the smaller of these two.

Hypothesis Testing
In testing extremely different treatments, the ranks will be clustered at different ends of the scale, and the Mann-Whitney U will be zero because one of the samples gets no points at all.

If treatments produce the same results, ranks will be interspersed, and the U value will be large.

Thus, the smaller the U, the more likely the result is to be statistically significant.

Use Table B.9 to determine critical value for the U; remember that the calculated U should be less than or equal to the critical value to reject the null.

College Student Grief Study: Mann-Whitney U Test
Let’s say, hypothetically, that 2 samples were drawn, one in the fall and one in the spring. We want to know if the samples were drawn from the same population and have relatively similar GPAs. The Whitney U Test will rank order the samples in terms of GPA, then determine the difference.

H0: There will be no difference in GPA between the two samples.

H1: There will be differences in GPA between the two samples.

Reporting Results
The GPA’s of the participants were rank ordered, and a Mann-Whitney U-test was used to compare the ranks for the n = 199 measured in the fall and n = 182 measured in the spring. The results indicate no significant differences based on the semester, U = 17118.5, p > .05, with the sum of the ranks equal to 37,018.5 in the fall and 35,752.5 during the spring.

Wilcoxon Test

Alternative to repeated-measures t-test

The Wilcoxon is designed to evaluate the difference between two treatments using data from a repeated-measures study (each participant measured twice); differences between 2 scores are evaluated.

These differences must be ranked from smallest to largest in terms of absolute values.

You can also simply rank amount of change in participants (without regard to direction)

H0: There is no difference between the two treatments. Therefore, in the general population there is no tendency for the difference scores to be either systematically positive or systematically negative.

H1: There is a difference between the two treatments. Therefore, in the general population the difference scores are systematically positive or systematically negative.

Wilcoxon Signed-Ranks Test
Again, the difference between the two treatments is recorded for each individual, and the absolute values of the difference scores are rank ordered.

After ranking the absolute values of the difference scores, separate the ranks into 2 groups: those with positive differences and those with negative differences.

Then, compute the sum of the ranks for the positive differences (increases) and for the negative differences (decreases).

The Wilcoxon T is the smaller of the two sums.

Interpretation of the T
A systematic difference between treatments will cause the difference scores to be consistently positive (or consistently negative) which will produce a small value for T. (For instance, when all differences are positive the sum of the negative ranks is zero.)

Thus, the smaller the T, the more likely the result is to be statistically significant.

Use Table B.10 to determine critical value for the T; remember that the calculated T should be less than or equal to the critical value, to reject the null.

For large samples, the normal approximation is used.

Tied Scores
Wilcoxon test assumes a continues DV.

Thus, tied scores are unlikely; if frequent, re-evaluate using Wilcoxon. Sometimes they do occur when:
        Participant has same score for both measures (difference = 0)
        2+ participants have identical difference scores (ignoring sign of the differences)

Option 1: discard tied scores, reducing sample size (n)

Option 2: divide zero differences evenly between positives and negatives (if odd #, discard 1), and assign them the average of the tied ranks.
        More conservative; more difficult to reject null
        Preferred way!!!

Kruskal-Wallis

The Kruskal-Wallis is used to evaluate differences between 3+ treatments using independent measures

This test is a nonparametric alternative to the single-factor ANOVA.

Similar to Mann-Whitney but involves more than 2 treatments.

Kruskal-Wallis H
First, obtain ordinal measurement by doing the following:

The 3+ separate samples are combined, and the entire group is rank ordered.

After being ranked, individuals (with assigned ranks) are separated back into their original samples or groups.

The score for each participant becomes the ordinal rank that was obtained in the last step.

Note that you can assign ranks based on scores from measurements or you can begin with rankings only.

Table 20.2 (p. 656)

Hypotheses
H0: There is no tendency for the ranks in any treatment condition to be systematically higher or lower than the ranks in any other treatment condition. There are no differences between treatments.

H1: The ranks in at least one treatment condition are systematically higher (or lower) than the ranks in another treatment condition. There are differences between treatments.

Calculating Kruskal-Wallis
Once rank ordered, the following are calculated:
    The ranks in each treatment are added to obtain a total or T value for that treatment condition.
    The number of participants in each treatment condition is identified by a lower-case n.
    The total number of participants in the entire study is identified by uppercase N.


Table 20.3 (p. 657)

The value of H is computed:
H =      12           (∑) – 3(N + 1)
         N(N + 1)        n

The value of H is evaluated as a chi-square statistic with degrees of freedom equal to the number of samples minus one (k – 1).

Then we use Table B.8 (The Chi-Square Distribution) to determine if the calculated H is significant. For the data in this example, k – 1 = 2, and the critical value for 2 df at .05 is 5.99. Thus, our calculated H must be greater than or equal to 5.99.

College Student Grief Study: Kruskal-Wallis H Test
Let's say that throughout the study, 4 samples are drawn. We want to know if the participants’ levels of depression changed with the 4 samples or if they differed.

H0: There is no difference in level of depression (Beck’s Depression Inventory ranged 0 – 63; the higher the score, the greater the depression) among the 4 samples.

H1: There is a significant difference in level of depression among the 4 samples.

Reporting the Results
After ranking the individual scores, a Kruskal-Wallis test was used to evaluate differences among the three treatments. The outcome of the test indicated no significant differences among the treatment conditions, H = 1.039 (3, N = 381), p > .05.

Friedman Test

The Friedman test is used to evaluate the differences between 3+ treatment conditions using data from repeated-measures designs.

Alternative to repeated-measures ANOVA.

Similar to Wilcoxon but utilizes more than 2 conditions.

Hypotheses
H0: There is no difference between treatments. Thus, the ranks in one treatment condition should not be systematically higher or lower than the ranks in any other treatment condition.

H1: There are differences between treatments. Thus, the ranks in at least one treatment condition should be systematically higher or lower than the ranks in another treatment condition.

Friedman Test
First, each individual’s performance is ranked across the different treatments (ranks will not exceed # treatments).

Note: Friedman can be used with scores on objective measurements, as long as they are converted to ranks.

The sum of the ranks is computed for each treatment (R1, R2…). If these are near the same values, the null is supported; if not, the alternative is supported.

Table 20.4 (p. 660)

Calculation of Friedman
a chi-square value is computed using the following formula:

Χr² =       12        ∑R²n – 3n(k + 1)
           nk(k + 1)

Note that the statistic is identified as chi-square (Χ²) with a subscript r, and corresponds to a chi-square statistic for ranks.

The chi-square statistic has degrees of freedom equal to the number of treatments minus one (k – 1) and is evaluated using the critical values in the chi-square distribution (Table B.8).

With df = k – 1 = 2, the critical value of chi-square is 5.99, so the calculated Χr² must be greater than or equal to 5.99.