[eDebate] Evolution of mutual pref - part 2 (an experiment)

Gary Larson Gary.N.Larson
Mon Apr 17 10:59:54 CDT 2006

The difference between using a modified ordinal system such as you
suggest and using a "grading" or "rating" system is the following.

Using ordinals while permitting ties and gaps suggests the following:

Assuming that the top four judges are tied, you could give them all
1's, all 4's, all 2.5's (or perhaps any number between 1 and 4).  But
the issue is what the next judge ranking is.  If you say that it's a 5,
you're still in the realm of an ordinal solution.  But if you let 1-4 be
1's and then let 5 be a 3 and never have any rule that they need to
correlate, you've moved into a "rating" system that really isn't ordinal
at all (in the statistician's sense of that word).

But your point might be that we need to have as many possible grades as
we have judges so that someone could be fully ordinal if they wanted to
be.  There are a couple of reasons I wouldn't go this direction.  First
we can permit teams to avoid all ties by not requiring whole number
values for the grades.  Unlike speaker points where we have good reasons
to not let people give 28.7 points to a debater, we need have no such
constraint on judge ratings, particularly if we're going to
statistically normalize them for the purpose of measuring mutuality. 
Second, I would like teams to be able to develop judge ratings that can
be tournament independent if the team wants them to be.  A 93 rating can
be a 93 whether there are 65 judges in the pool or 180.  Teams could
either manage their own database of judge ratings or it could even be
centrally managed by Bruschke (assuming that we had adequate data

>>> Michael Kloster <kloster at mynamesmike.com> 4/16/2006 6:27:52 pm >>>
Your suggestion of using grading preference makes sense.

I have a question. Does it make sense to use a grading scale equal to
the number of judges? Filling out preference sheets would be like
ordinal ranking with the ability to leave gaps and double up.


Current ordinal ranking of

1 Judge A
2 Judge B
3 Judge C

Could become:

1 Judge A
3 Judge B
3 Judge C

or could become:

1 Judge A
1 Judge B
3 Judge C


Michael Kloster

Gary Larson wrote:
> As I concluded my last post, I noted that the third critique of
> preference is a concern that as currently practiced, it might not
> each of the teams that fill out the sheets equally and fairly.
> Each team in the tournament has a different valuation of each of the
> judges in the tournament.  Some they know and like.  Some they know
> don't like.  Some they are neutral about.  Some they are afraid of. 
> Many they don't really know at all except perhaps by a reputation
> might be unwarranted.  Some teams believe that they could if
> provide an ordinal ranking of the judges from 1 to xxx.  Others
> that they can only place them in a few broad categories with
> numbers of judges in each category and perhaps very different senses
> how large the differences are between each judge within and between
> various categories.  To this real but wildly variable intuition
> judge preference, we necessarily impose structure so that we can
> for "mutuality" or "commensurability."
> In all (or nearly all) tournaments with mutual preference, this
> structure comes in the form of "categories."  We require teams to
> judges into 3 or 4 or 6 or 9 discrete categories where all judges
> the category are treated as equal to each other and each category is
> treated as different from the next in absolute letter-grade or
> terms.  Unfortunately, no matter how many or how few categories we
> and how big or how small we make them, it is almost NEVER the case
> for any team, their natural clustering of judges into preference
> matches the category boundaries that we impose.  Particularly when
> had a standard ABCX practice of something like 40-30-20-10 or
> 50-20-20-10, the A category only accidentally matched the list of
> critics that any team would argue is an A for them.  Particularly
> one team would claim that they only knew 20% of the critics while
> opponents knew 50% or more of the critics, it could be argued that an
> match is not mutual at all.  Over time, as the pairing algorithms
> improved, the sense of how large a category needs to be to be
> computationally manageable has decreased.  But it still results in
> necessary trade-offs.  What happens when the categories are
> as opposed to 40-30-20-10.  Clearly it is not possible to have as
> AA matches in the former as in the latter, but do actual preference
> mutuality increase or decrease?  How about when we move from 4 to 6
> categories as the NDT has done or 9 as CEDA has done?  On one level,
> tournament experience has argued that both preference and mutuality
> improve (within limits) as the number of categories increase and as
> categories become smaller.  The tests that Rich Edwards and John
> intend to do with the ordinal data that is being collected will go a
> long way toward demonstrating where the actual trade-offs and
> limits appear.
> But "categories" present inherent problems regardless of size and
> number.  It is still almost never the case no matter how many or how
> small the categories that they will match the intuitions of any of
> teams in the tournament.  When I tell every team in the tournament
> they will all have 20 or 40 or 60 1's (A+'s), that category
> will never mean the same thing for every team.  Categories also
> myths about mutuality.  We assume that a 11 is a mutual match while a
> isn't.  But if we assume 6 categories (20-20-15-15-15-15) for sake
> argument, a 11 match might be 1-20 while a 12 match might be 20-21
> course it might also be 1-40).  And one of the teams might have
> in their 1 group that they wish were 2's while the other team has
> in their 2 group that they would have been satisfied if they had to
> 1's.  So who is it an off-match for?  As the number of categories
> increases actual mutuality generally increases but the perception of
> mutuality might decrease.  When the NDT split the A group into two
> halves A+ and A, one of the outcomes was that there were more A+A
> matches.  In some minds this creates the impression of decreased
> mutuality (more off-matches).  But if you start with AA matches that
> include A+A+, AA and A+A you've clearly increased mutuality rather
> decreased it.  A similar argument happens at CEDA Nats, While we try
> very hard to avoid it, there are some 13 matches.  This creates the
> perception of a clearly unacceptable lack of mutuality even though
> percentage terms even a 14 match would have been a mutual AA in old
> parlance and an A+A (in almost every case) at present.
> At CEDA a few years back we tested a solution to the problems
> by categories by using ordinal rankings.  Everyone ranked the 165
> 1-xxx (a few used ties but most struggled through the process of
> creating unique ordinal rankings for each judge).  On one level, the
> experiment was a success.  In the aggregate teams received
> higher preference and higher mutuality than they would have received
> the judges had been assigned to the four categories that the
> used the previous year or the nine categories that the tournament
> used since.  But participant satisfaction was not as high for many
> the tournament.  Most disliked the ranking task and realistically
> has reliable and valid judgments that would rank judges from 1-165. 
> Additionally, folks lost the satisfaction of getting a 1 (or an A). 
> Even though an ordinal 14 when your opponent got an 11 is more
> and higher preference than the majority of AA or 11 placements, it
> doesn't "sound" or "feel" like it is.
> More critically, ordinal preferences impose a different kind of
> structure on the intuitive preferencing of judges that might be just
> suspect as the placement of judges into arbitrarily defined
> of specified sizes.  It assumes that the difference in preference
> between any two critics is equal in magnitude to the ordinal
> in their ranking.  This clearly violates the clustering that
> defines most of our judgments.  It also imposes a linearity on the
> entire population of judges when they most probably fall into a
> bell-curve (normal) distribution.  It doesn't respect the natural
> within the distribution nor does it recognize that the distance
> the 10th and 15th judge in my ranking might be MUCH greater than the
> distance between my 95th and 100th out of 180 (a true observation
> which I am indebted to John Fritch for pointing out).
> Hence an experiment:
> I played with a system awhile back that permitted everyone to grade
> judges just like they grade other performances.  Instead of putting
> judges into pre-defined norm-referenced groups (essentially curving
> judges by saying that x% get A's ...) we could let teams put judges
> criteria-referenced groups.  We could permit teams to assign judges
> score of 1-100 (or 1-30 or whatever) where there were no external
> mandates on the distribution of scores that any team gave to the
>  Of course, the problem of incommensurability immediately becomes
> evident.  How do we define mutuality when everyone has a different
> distribution?  One of the advantages of categories is that they
impose a
> kind of commensurability even if it is an illusion.  And what would
> prevent a team from rating ALL of the judges low, not because they
> wanted mutuality to affect the assignment of the judges they get but
> rather because they want to adversely affect the judges their
> gets.   If I think that I like 15% of the judges and my opponent
> 65% of the judges I might despair of meeting my opponent with one of
> 15%.  So I might race to the bottom, hoping that I can keep my
> from getting any of their 65%.  Even if I discount the "perversity"
> this possibility, how do I ever define mutuality other than pray
> everyone has basically the same strategy for assigning ratings (sort
> like speaker points isn't it).
> Oddly enough, the technological solution that works like speaker
> is readily available (and quite defensible) through the use of
> normalizations.  If we were willing to forgo some of the transparency
> current categorization schemes we could achieve arguably the
> most elegant and arguably most fair way of providing true mutuality
> preference by doing the following:
> Let every team assign ratings to every judge in the pool using
> distributional logic that makes sense to them.  It could have many
> no) ties, it could have a high mean or a low mean ...  Additionally,
> teams wouldn't have to redo their ratings for different tournaments
> (unless their opinion of the judge changed).  Teams wouldn't have to
> count slots, tournaments wouldn't have validate them.  Then when the
> tournament uploaded pref data for the judges who ultimately appear
> the pool (with much less concern about judges entering or leaving
> pool at the last minute as long as ratings for all judges potentially
> the pool exist), the software would normalize each team's
> of ratings using the z-score transform.  The mutual preference task
> then to maximize ratings while minimizing the mutuality difference
> represented by the z-score.
> On one level, this is exactly what we wish to do with our intuition
> mutuality.  We desire that within the distribution of judges for the
> teams, the assigned judge occupies the same place.
> Perhaps the biggest disadvantage is that it might seem like voodoo,
> placing even more power into the black box (just like high/low
> pairings are no longer very inspectable unless one has all the data).
> would be glad to answer questions as to how it would work and how it
> would treat the various anomalous distributions that teams might
>  But based on simulations, this solution combines much greater ease
> use for the end users while more closer matching their actual
> about the judging pool.
> I would propose that we discuss it and then that one or more
> tournaments volunteer to test it as an alternative.
> _______________________________________________
> eDebate mailing list
> eDebate at ndtceda.com 
> http://www.ndtceda.com/mailman/listinfo/edebate 

More information about the Mailman mailing list