[eDebate] Mutual Preference at Kentucky
Thu Sep 7 12:02:04 CDT 2006
As noted by JW, we are going to experiment with a new way of
establishing the "mutual" part of "mutual preference" at the Kentucky
If the only issue was maximizing preference, any system of ranking
judges would be simple to implement. We could have any number of
categories, ordinal ranking, judge ratings ... Our goal would be to get
the best judge possible for both teams but we wouldn't worry whether the
outcome was mutual.
The real issue is to measure and establish an appropriate level of
mutuality. Each system that has been used - circles and strikes, ABCX,
A+AB+BCX, 9 cateories 1-9, or ordinal possess some advantages but also
some real weaknesses in establishing mutuality.
If we have a small number of categories with a large number of judges
in each, it is relatively easy to have nearly all rounds be exact
matches, but of course what it means to be an "exact" match becomes less
meaningful since the "within-category" difference can be quite large.
As we increase the number of categories and decrease the number of
judges within each category, it becomes more difficult to have "exact"
matches and also more difficult to pair rounds with the highest
available category. But it is arguable that the final outcome is still
superior in terms of absolute preference and absolute mutuality, but the
community continues to have discussions about the relative merits of
using 4, 6, or 9 categories. But regardless of the number of
categories, any "category" based system suffers from the incongruity
between in-category and between-category differences and the fact that
the tournament mandate to have a minimum quota of judges in each of the
categories fails to reflect genuine differences between teams regarding
how many judges they might genuinely defined as A's or B's or whatever.
An alternative that was tested briefly about five years ago was to use
ordinal rankings rather than categories where it can be "proven" that
you can better maximize and measure both mutuality and preference in
continuous rather than categorical terms. While the experiment did
result in "improved" outcomes, most teams found the ranking challenge to
be more work than they desired. Ordinal rankings also suffer from the
"myth" that the qualitative difference between any two ranks are
proportional to the numeric difference between those ranks. In other
words, is the difference between my 5th and my 10th judge identical to
the difference between my 25th and my 30th or between my 50th and my
55th? This continuous linear assumption violates the intuition that
judge ratings cluster together (though not in the artificial groups
created by categorical systems) and that they might be arrayed in a bell
As a result, the experiment is twofold. We are going to permit teams
to assign ratings 0-100 to all judges in the pool with NO arbitrary
limitations on what values get assigned (though I will recommend that
folks aim for a mean of 50). Taken by itself, this rating task will
help address the issue of commensurability. Are there real differences
between the ways that various teams evaluate the available judging pool?
All of our systems to date force a version of commensurability in order
to manage mutuality.
The second is more ambitious. To the extent that we discover that
there ARE significant incommensurabilities, what is the best strategy
for imposing mutuality. The data that is collected will be able to be
translated into equivalent 4-category, 6-category, 9-category or ordinal
equivalents so that we could see how each system would function. But
for the purpose of the tournament, we will test another alternative.
The statistical procedure that we use for speaker point tie breakers,
z-scores, can be used to transform all of the various distributions into
more commensurable "normalized" distributions that can be used for
mutuality. So the mutuality judgment for each judge assignment will be
based on the z-score, the number of standard deviations above or below
the sample means, for each of the two teams. But at the same time, I
will be able to report the resulting mutuality based on how it would
have been computed in each of the competing categorical or ordinal
systems. So while the technology might make the outcome seem more
opaque, we will be very open in post-tournament reporting of results.
In attempting this experiment, I do have one strong recommendation. In
all of the mutual preference schemes used to date, some teams conclude
that there must be some way to "game" the system to obtain better
preferences than one might expect or worse preferences for one's
opponent than they might expect. So teams rank judges they don't want
as A's, thinking that no one else will prefer them at all, thereby
allowing them to concentrate their A's or use mutuality as a means of
increasing their strikes. Someone will be tempted to conclude that
there must be some way to create ratings for judges that will accomplish
the same kind of outcome. While I seriously doubt that you will
succeed, my real request is that you don't even try. The initial and
perhaps most important result of the experiment is NOT the use of
z-scores for mutuality but rather getting an absolutely honest
distribution of ratings for each team to test our assumptions about
commensurability. If your ratings don't reflect you genuine evaluation
of each of the judges, that foundational objective won't be met.
As a side note, IF this experiment is successful, the collection of
judge ratings is something that can be tournament independent since
teams wouldn't be required to re-rate judges each tournament to meet the
quotas that they would impose. In fact, even if tournaments chose to
use categorical or ordinal systems, the data "could" be directly mapped
from the ratings that would be in the database (editable whenever teams
chose to revise their data).
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Mailman