[eDebate] Evolution of mutual pref - part 2 (an experiment)

Gary Larson Gary.N.Larson
Sun Apr 16 15:49:35 CDT 2006

As I concluded my last post, I noted that the third critique of mutual
preference is a concern that as currently practiced, it might not treat
each of the teams that fill out the sheets equally and fairly.

Each team in the tournament has a different valuation of each of the
judges in the tournament.  Some they know and like.  Some they know and
don't like.  Some they are neutral about.  Some they are afraid of. 
Many they don't really know at all except perhaps by a reputation that
might be unwarranted.  Some teams believe that they could if required
provide an ordinal ranking of the judges from 1 to xxx.  Others believe
that they can only place them in a few broad categories with differing
numbers of judges in each category and perhaps very different senses of
how large the differences are between each judge within and between the
various categories.  To this real but wildly variable intuition about
judge preference, we necessarily impose structure so that we can provide
for "mutuality" or "commensurability."

In all (or nearly all) tournaments with mutual preference, this imposed
structure comes in the form of "categories."  We require teams to sort
judges into 3 or 4 or 6 or 9 discrete categories where all judges within
the category are treated as equal to each other and each category is
treated as different from the next in absolute letter-grade or numerical
terms.  Unfortunately, no matter how many or how few categories we have
and how big or how small we make them, it is almost NEVER the case that
for any team, their natural clustering of judges into preference groups
matches the category boundaries that we impose.  Particularly when we
had a standard ABCX practice of something like 40-30-20-10 or
50-20-20-10, the A category only accidentally matched the list of
critics that any team would argue is an A for them.  Particularly when
one team would claim that they only knew 20% of the critics while their
opponents knew 50% or more of the critics, it could be argued that an AA
match is not mutual at all.  Over time, as the pairing algorithms have
improved, the sense of how large a category needs to be to be
computationally manageable has decreased.  But it still results in
necessary trade-offs.  What happens when the categories are 25-25-25-25
as opposed to 40-30-20-10.  Clearly it is not possible to have as many
AA matches in the former as in the latter, but do actual preference and
mutuality increase or decrease?  How about when we move from 4 to 6
categories as the NDT has done or 9 as CEDA has done?  On one level,
tournament experience has argued that both preference and mutuality can
improve (within limits) as the number of categories increase and as the
categories become smaller.  The tests that Rich Edwards and John Fritch
intend to do with the ordinal data that is being collected will go a
long way toward demonstrating where the actual trade-offs and optimum
limits appear.

But "categories" present inherent problems regardless of size and
number.  It is still almost never the case no matter how many or how
small the categories that they will match the intuitions of any of the
teams in the tournament.  When I tell every team in the tournament that
they will all have 20 or 40 or 60 1's (A+'s), that category membership
will never mean the same thing for every team.  Categories also create
myths about mutuality.  We assume that a 11 is a mutual match while a 12
isn't.  But if we assume 6 categories (20-20-15-15-15-15) for sake of
argument, a 11 match might be 1-20 while a 12 match might be 20-21 (of
course it might also be 1-40).  And one of the teams might have judges
in their 1 group that they wish were 2's while the other team has judges
in their 2 group that they would have been satisfied if they had to be
1's.  So who is it an off-match for?  As the number of categories
increases actual mutuality generally increases but the perception of
mutuality might decrease.  When the NDT split the A group into two
halves A+ and A, one of the outcomes was that there were more A+A
matches.  In some minds this creates the impression of decreased
mutuality (more off-matches).  But if you start with AA matches that now
include A+A+, AA and A+A you've clearly increased mutuality rather than
decreased it.  A similar argument happens at CEDA Nats, While we try
very hard to avoid it, there are some 13 matches.  This creates the
perception of a clearly unacceptable lack of mutuality even though in
percentage terms even a 14 match would have been a mutual AA in old NDT
parlance and an A+A (in almost every case) at present.

At CEDA a few years back we tested a solution to the problems created
by categories by using ordinal rankings.  Everyone ranked the 165 judges
1-xxx (a few used ties but most struggled through the process of
creating unique ordinal rankings for each judge).  On one level, the
experiment was a success.  In the aggregate teams received significantly
higher preference and higher mutuality than they would have received if
the judges had been assigned to the four categories that the tournament
used the previous year or the nine categories that the tournament has
used since.  But participant satisfaction was not as high for many at
the tournament.  Most disliked the ranking task and realistically NOBODY
has reliable and valid judgments that would rank judges from 1-165. 
Additionally, folks lost the satisfaction of getting a 1 (or an A). 
Even though an ordinal 14 when your opponent got an 11 is more mutual
and higher preference than the majority of AA or 11 placements, it
doesn't "sound" or "feel" like it is.

More critically, ordinal preferences impose a different kind of
structure on the intuitive preferencing of judges that might be just as
suspect as the placement of judges into arbitrarily defined categories
of specified sizes.  It assumes that the difference in preference
between any two critics is equal in magnitude to the ordinal difference
in their ranking.  This clearly violates the clustering that intuitively
defines most of our judgments.  It also imposes a linearity on the
entire population of judges when they most probably fall into a
bell-curve (normal) distribution.  It doesn't respect the natural breaks
within the distribution nor does it recognize that the distance between
the 10th and 15th judge in my ranking might be MUCH greater than the
distance between my 95th and 100th out of 180 (a true observation for
which I am indebted to John Fritch for pointing out).

Hence an experiment:

I played with a system awhile back that permitted everyone to grade
judges just like they grade other performances.  Instead of putting
judges into pre-defined norm-referenced groups (essentially curving
judges by saying that x% get A's ...) we could let teams put judges into
criteria-referenced groups.  We could permit teams to assign judges a
score of 1-100 (or 1-30 or whatever) where there were no external
mandates on the distribution of scores that any team gave to the judges.
 Of course, the problem of incommensurability immediately becomes
evident.  How do we define mutuality when everyone has a different
distribution?  One of the advantages of categories is that they impose a
kind of commensurability even if it is an illusion.  And what would
prevent a team from rating ALL of the judges low, not because they
wanted mutuality to affect the assignment of the judges they get but
rather because they want to adversely affect the judges their opponent
gets.   If I think that I like 15% of the judges and my opponent likes
65% of the judges I might despair of meeting my opponent with one of my
15%.  So I might race to the bottom, hoping that I can keep my opponent
from getting any of their 65%.  Even if I discount the "perversity" of
this possibility, how do I ever define mutuality other than pray that
everyone has basically the same strategy for assigning ratings (sort of
like speaker points isn't it).

Oddly enough, the technological solution that works like speaker points
is readily available (and quite defensible) through the use of z-score
normalizations.  If we were willing to forgo some of the transparency of
current categorization schemes we could achieve arguably the simplest,
most elegant and arguably most fair way of providing true mutuality and
preference by doing the following:

Let every team assign ratings to every judge in the pool using whatever
distributional logic that makes sense to them.  It could have many (or
no) ties, it could have a high mean or a low mean ...  Additionally,
teams wouldn't have to redo their ratings for different tournaments
(unless their opinion of the judge changed).  Teams wouldn't have to
count slots, tournaments wouldn't have validate them.  Then when the
tournament uploaded pref data for the judges who ultimately appear in
the pool (with much less concern about judges entering or leaving the
pool at the last minute as long as ratings for all judges potentially in
the pool exist), the software would normalize each team's distribution
of ratings using the z-score transform.  The mutual preference task is
then to maximize ratings while minimizing the mutuality difference as
represented by the z-score.

On one level, this is exactly what we wish to do with our intuition of
mutuality.  We desire that within the distribution of judges for the two
teams, the assigned judge occupies the same place.

Perhaps the biggest disadvantage is that it might seem like voodoo,
placing even more power into the black box (just like high/low bracket
pairings are no longer very inspectable unless one has all the data).  I
would be glad to answer questions as to how it would work and how it
would treat the various anomalous distributions that teams might create.
 But based on simulations, this solution combines much greater ease of
use for the end users while more closer matching their actual intuitions
about the judging pool.

I would propose that we discuss it and then that one or more
tournaments volunteer to test it as an alternative.


More information about the Mailman mailing list