[eDebate] Mutual Preference experiment - part 2

Gary Larson Gary.N.Larson
Wed Nov 22 10:59:40 CST 2006

The current experiment grew out of discussions surrounding last year's
NDT where questions were asked about the number of categories (6 vs. 9)
and the appropriate quotas for each category.  With respect to the
latter question, should the categories be equal-sized, top-loaded (ii.e.
require more judges in A+, A than other categories) or middle weighted
(e.g. 10%, 15%, 25%, 25%, 15%, 10%).  Given a potentially infinite array
of category systems and the need to test them with real data but outside
of the NDT tournament experience, NDT participants were asked to supply
ordinal rankings to all of the judges at the tournament so that we could
run simulations of a wide variety of categorical alternatives to
determine the best outcome as well as evaluating the possibility of
using the ordinal data directly to do the judge assignments.

Unfortunately, a large percentage of tournament participants never
supplied the data to make the simulations possible.  Additionally, some
argued that translating ordinal data into categories wouldn't reflect
how they would assign judges to categories since they had a number of
strategies that they believed improved their outcomes by assigning
judges to a different category than would be predicted by the ordinal
data and/or by assigning more judges than required to high categories.

This raised a couple of fundamental questions.  1) Do categories by
their very nature skew pref and mutuality because of the way they treat
in-category vs. between-category differences?  2)  Do category systems
inherently fail because the imposed structure of the quotas fails to
match the natural clustering of judge ratings on any given team's sheets
(and perhaps more egregiously for certain classes of teams)?  3)  Do
ordinal systems inherently fail because they don't reflect clustering at
all, rather treating the actual difference in preference as
proportionate to the ordinal difference?

To answer any of these questions, the fundamental necessity was to
obtain actual preference information without any externally-imposed
constraints.  Other than anecdotes, we really don't know how well our
categorical systems encode the actual preferences of tournament
participants.   Given the fact that the data had to be REAL and had to
be linked to an actual judging pool and actual tournament pairings, two
possibilities present themselves.  The option tried after the NDT was to
obtain the data post facto.  In addition to the fact that the post facto
option didn't work due low return rates on the ordinal ranking task, it
also raises the prospect where the results would be tainted by the
actual outcomes of the tournament (e.g. I choose to rate judges lower
after the tournament due to the fact that they voted against me at the

This dilemma led to the next evolutionary step, namely the use of the
rating data directly in a tournament setting by using z-scores to encode
mutuality of otherwise incommensurable rating distributions.  After
running hundreds of hypothetical tournament rounds, having posted the
possibility to edebate for discussion late spring and having discussed
it with Fritch and others, I proposed to JW that we run the experiment
at Kentucky.

Crunching a wide array of statistics after the tournament, it appeared
that in terms of overall outcome, the alternative system did at least as
good in maximizing preference and mutuality as had been experienced in
previous Kentucky tournaments that I had run with 9 categories. 
Anecdotal responses were mixed but as I noted before, most focused on
the placement of a single judge that they wished they hadn't have
received.  Others suggested that their principal concern was that they
hadn't yet figured out how to fill out their sheet appropriately.  In
that context, Ross approved using the system once again at Wake.

Before presenting the data in the next post, one question that requires
a response was posed by a coach whose team received a critic they were
unhappy with in the 8th round at Kentucky.  The coach asked whether it
was ethically appropriate to experiment with the tournament format or
judge assignment system at an important tournament?  Does it potentially
impact the final outcome and as a result the lives of particularly
senior students?  I've spent a lot of time thinking about the charge and
believe that it ultimately is appropriate to experiment with formats and
strategies at regular season tournaments, providing that adequate
testing has preceded the experiment and the subsequent demonstration
that the aggregate preference results and internal fairness of those
results weren't adversely affected.  But what about the power of the
anecdote in the life of the student affected?  Unfortunately, we have no
way of verifying whether or not the individual outcome wouldn't have
happened in another system.  If I count the total number of rounds at,
above or below the break where the pref was NOT ideal, it certainly
happened, but it happened no more often than in previous years.  But I
am open to the community's critique in this matter. 

So, what hypotheses did we have going into the experiment?

1)  The distribution of actual preferences would not cluster into our
typical quota-driven category schemes.
2)  The distribution of actual preferences would for most distribute
"normally," consistent with other rating tasks that we perform
3)  The translation of preferences into ordinal equivalents would
provide superior outcomes to translating them into categorical
4)  The use of z-scores (by themselves or in combination with other
available measures) would equally or better provide mutuality even
though the raw data wasn't constrained.


If this was social science research rather than an operational attempt
to discover ever better ways to run our tournaments, I would have to
immediately admit that the data that I received was flawed.  If the goal
was to obtain the "real" preference data and to use it to evaluate the
overall success and fairness of various algorithms or coding schemes, I
largely failed.

At the end of the day for a large number of teams, competition trumps
nearly all other considerations.  The rating of judges, regardless of
the system used, is part of the competitive exercise rather than a
disconnected evaluation of competence, fairness, or some other judge
characteristic.  In social science terms, the experiment became a
classic case of where collecting the data changes the data.  For a large
number of teams, the rating task created additional angst by raising the
specter that other teams had been able to figure out a magic strategy to
maximize their preferences that they somehow hadn't envisioned.  At the
same time, folks hypothesized regarding how they could code the judges
to maximize their own outcomes.  The operational definition of
"fairness" becomes "I don't want to be disadvantaged."  While I agree
with the sentiment it doesn't necessarily mean that I would be unhappy
if I could create an outcome that produces an "advantage" for me.  While
it probably only affected a minority of teams in extremis, Will is right
that teams attempted to discover the boundaries of advantage (or
avoiding disadvantage) and thereby adopted a variety of creative

While I'll let everyone examine the data, the simple conclusion is that
a rating system with NO external constraints will not be a practicable
alternative for judge assignment.  When a few teams try creative
strategies to "game" the system, some might appear to succeed while
others might appear to fail.  While its arguable that it will all wash
out in the end so that essentially everyone will move to a more centrist
rating strategy (just as folks have experimented with a variety of
creative approaches to categories but generally fall back to a norm), in
the meantime folks will be unhappy both that some teams seemed to be
rewarded for their game (presumably at everyone else's expense), others
will be unhappy whenever their creative strategy fails, ad perhaps all
of us will be unhappy when more teams try more creative strategies to
the point where it necessarily breaks the system (if only temporarily).

Will's discussion of the lack of formal "strikes" illustrates the
problem.  On face, Will's concern is that in a small number of
instances, teams received a judge that they had rated a "0".  Obviously
if the system could/would assign a 0 it meant that it wasn't
interpreting any rating as an inviolable strike.  The fact that it
happened in a small number of rounds where teams had not been eliminated
indicates a problem with unconstrained ratings.  The fact that it
occurred for a small number of teams after they were eliminated
illustrates a different issue.  Both deserve comment.  

At Wake, four teams each received a zero in a round in which they had
not been eliminated.  In each of the four case, the teams chose to rank
50% of more of the judges in the pool as 0.  In the most extreme case, a
team at Wake ranked 72.3% of all judges in the pool as 0.  Now clearly a
team can't expect over 50% strikes.  No algorithm has yet "guaranteed"
that 100% of judges will be in the top 50% of the distribution (even
though it frequently happens for a number of teams).  Needless to say,
even after adopted the bold strategy a team complained that I hadn't
honored what should have been interpreted as a strike.  Even though I
could "explain" why the algorithm did what it did, where's the magic
line that Will and others would expect the program to honor?  How many
0's can I get away with (in the spirit of full disclosure, the team that
ranked 72% as 0's got lucky and didn't get any).  Will is probably right
that we need (regardless of system) to create a community consensus as
to where the "strike" line is and then honor it - both by not assigning
judges coded as such AND honor it by not permitting teams to rank more
judges as strikes than appropriate.

The second scenario was in my judgment more problematic.  At both Wake
and Kentucky a small number of teams received 0's after having been
eliminated.  In almost all cases they had rated significantly more
judges as 0's than the roughly 11% that typically get coded as 9's but
in no case did a team rate more than 35% of the judges as 0's.  Did such
team's deserve such an outcome?  Again without a bright line it is very
difficult to say.  But Will is right that the algorithm didn't
"guarantee" that they wouldn't receive 0's for precisely the same reason
that it couldn't/shouldn't guarantee that a team that rated over 70% as
0's not get such a judge.  If this line of experimentation is to
continue, we definitely need to define a "bright line" at the bottom.

Will's critique reflects another important discussion that needs to
continue regardless of the system employed.  As expectations increase,
it is increasingly difficult to give mid-range judge assignments without
complaint.  In the 9-category system with 11% in each category, it has
become the case that folks consider 3's mediocre, 4's as really
marginal, 5's as borderline unacceptable, and everything else as
absolutely unacceptable.  With the ratings system, since the ratings
themselves are variable from team to team, we did our diagnosis of
outcomes using ordinal equivalents.  But this ups the ante even further.
 At Wake, an ordinal ranking of 45 "seemed" to be quite bad even though
it would translate into a 3 in a 9-category system (159 total judges). 
A 60 seemed disastrous even though it's a high 4.  As a result,
particularly at Wake, the preferences in rounds 1-7 were significantly
better than usual (better than Kentucky, better than previous Wake
tournaments).  But this came at significant cost that Ross couldn't
accommodate in round 8.  So Will was right, prefs in round 8 were below
normal.  While I could argue that its the aggregate result that matters
(for which Wake exceeded all expectations), round 8 is a crucial last
taste in the mouth both for those in break rounds AND those already
eliminated.  If we had it to do over, prefs would indeed have been lower
in preceding rounds to preserve a higher pref pool for the last round. 
As a side note, this problem was exascerbated by the previous issue of
not constraining the number of 0's.  Because of the lack of constraints,
MANY more 0's and otherwise low ranks were given at Wake than normal (or
even at Kentucky).  As a result 14 judges (58 rounds of obligation) had
average ratings LESS than 20.

The short answer is that the system used at Kentucky and Wake probably
could not/should not be used at future tournaments without modification.
 But before we get to that discussion we should look at the data.

More information about the Mailman mailing list