[eDebate] Mutual preference experiment - what next
Wed Nov 22 17:04:39 CST 2006
At this point I am open to both open discussion and back channels
regarding where we go next. Those who were less than thrilled with
their personal results at Wake and/or Kentucky have expressed fear that
the system would be imposed at Northwestern or CEDA Nats. Others have
expressed excitement about the possibility that using rating data might
facilitate creating a web-based database of "permanent" judge ratings
that could be re-used or adapted from week to week by tournaments that
might even use different judge assignment systems or algorithms.
I need to make it clear that the experiment thus far has been just
that. It isn't yet and perhaps never will be a "new system" that I will
try to sell to the community. In that regard it should be noted that
neither the use of ordinals at CEDA Nats a few years back nor the move
to nine categories at tournaments I run were imposed on the community by
me. I genuinely believe we need to have an extended dialogue about how
category systems should be designed if they are used, whether ordinal
systems are preferable, and finally whether ratings systems can be
"tweaked" with sufficient constraints to ensure that they can work
fairly for all teams but not so many constraints that they pose an even
more difficult or less satisfactory ratings task than categorical or
ordinal systems would impose.
While the aggregate data at Wake and Kentucky were very encouraging,
looking at the data, I do think that the "rush to the bottom" creates
inevitable problems that would require some form of external mandate.
The presence of bi-modal distributions doesn't break the system but it
does raise questions regarding the usefulness of z-scores to measure
mutuality. In fact, at Wake ordinal difference was introduced as an
additional test on mutuality to prevent anomalous outcomes (given the
fact that a few bimodal and skewed teams had a mean that was less than
their stdev). While it would probably all recalibrate over time just as
other systems have, I suspect that the complaints in the meantime would
mitigate against a move to unconstrained ratings.
So what should me do? I think that several avenues warrant
consideration. I think that an ordinal system where we permit ties
(coded either at the top or potentially the middle of the range) has the
possibility for significant merit. It is still the case from a
statistical point of view that an ordinal system permits the best
maximization of mutuality and preference. Second, it is possible that a
minimal set of constraints could "tweak" a ratings system to preserve
its strengths while minimizing its weaknesses. Third, given that the
most popular tournament management packages rely on categories and given
the fact that we have years of experience using them, it may be possible
to design a category system that addresses the issues that have been
identified. For instance, it might be useful to adopt a formula that
essentially divides the judging pool by 10 to determine the number of
categories (e.g. 16 at Wake) and then assign 10 per category. In any
case, Will's concern that we create a bright line for strikes at an
appropriate point in the distribution is well taken.
Within the discussion about systems we need to re-open two fundamental
questions (assuming that we remain committed to MPJ for some tournaments
- a whole different topic). First we do need to discover honest and
accurate preference data that is independent of the inevitable strategic
considerations of a given tournament.
In that regard, I would be interested in finding out whether folks
would be willing to modify the pref data I have from Wake for the
purpose of additional testing. Teams would start with the rating data
they already provided and just modify it to represent what they think
absent any strategy (since the tournament is already over, their
judgments wouldn't hurt them).
Second, I think that we need to frankly discuss how we prioritize
mutuality and preference as the coding systems and algorithms become
ever more powerful. What counted as a consensus in ABCX might no longer
hold. And if we went in the direction of ordinal, where do the
trade-off points occur.
Thanks for the opportunity to work with the community in the continuing
search for the Holy Grail. I trust that innovation will make the
activity better for all of its participants (or that we discover the
info necessary to scrap our mistakes).
More information about the Mailman