[eDebate] Evolution of mutual pref - part 2 (an experiment)

Gary Larson Gary.N.Larson
Mon Apr 17 10:59:54 CDT 2006


The difference between using a modified ordinal system such as you
suggest and using a "grading" or "rating" system is the following.

Using ordinals while permitting ties and gaps suggests the following:

Assuming that the top four judges are tied, you could give them all
1's, all 4's, all 2.5's (or perhaps any number between 1 and 4).  But
the issue is what the next judge ranking is.  If you say that it's a 5,
you're still in the realm of an ordinal solution.  But if you let 1-4 be
1's and then let 5 be a 3 and never have any rule that they need to
correlate, you've moved into a "rating" system that really isn't ordinal
at all (in the statistician's sense of that word).

But your point might be that we need to have as many possible grades as
we have judges so that someone could be fully ordinal if they wanted to
be.  There are a couple of reasons I wouldn't go this direction.  First
we can permit teams to avoid all ties by not requiring whole number
values for the grades.  Unlike speaker points where we have good reasons
to not let people give 28.7 points to a debater, we need have no such
constraint on judge ratings, particularly if we're going to
statistically normalize them for the purpose of measuring mutuality. 
Second, I would like teams to be able to develop judge ratings that can
be tournament independent if the team wants them to be.  A 93 rating can
be a 93 whether there are 65 judges in the pool or 180.  Teams could
either manage their own database of judge ratings or it could even be
centrally managed by Bruschke (assuming that we had adequate data
security).

>>> Michael Kloster <kloster at mynamesmike.com> 4/16/2006 6:27:52 pm >>>
Your suggestion of using grading preference makes sense.

I have a question. Does it make sense to use a grading scale equal to
the number of judges? Filling out preference sheets would be like
using
ordinal ranking with the ability to leave gaps and double up.

Example:

Current ordinal ranking of

1 Judge A
2 Judge B
3 Judge C

Could become:

1 Judge A
3 Judge B
3 Judge C

or could become:

1 Judge A
1 Judge B
3 Judge C


etc.

Michael Kloster



Gary Larson wrote:
> As I concluded my last post, I noted that the third critique of
mutual
> preference is a concern that as currently practiced, it might not
treat
> each of the teams that fill out the sheets equally and fairly.
> 
> Each team in the tournament has a different valuation of each of the
> judges in the tournament.  Some they know and like.  Some they know
and
> don't like.  Some they are neutral about.  Some they are afraid of. 
> Many they don't really know at all except perhaps by a reputation
that
> might be unwarranted.  Some teams believe that they could if
required
> provide an ordinal ranking of the judges from 1 to xxx.  Others
believe
> that they can only place them in a few broad categories with
differing
> numbers of judges in each category and perhaps very different senses
of
> how large the differences are between each judge within and between
the
> various categories.  To this real but wildly variable intuition
about
> judge preference, we necessarily impose structure so that we can
provide
> for "mutuality" or "commensurability."
> 
> In all (or nearly all) tournaments with mutual preference, this
imposed
> structure comes in the form of "categories."  We require teams to
sort
> judges into 3 or 4 or 6 or 9 discrete categories where all judges
within
> the category are treated as equal to each other and each category is
> treated as different from the next in absolute letter-grade or
numerical
> terms.  Unfortunately, no matter how many or how few categories we
have
> and how big or how small we make them, it is almost NEVER the case
that
> for any team, their natural clustering of judges into preference
groups
> matches the category boundaries that we impose.  Particularly when
we
> had a standard ABCX practice of something like 40-30-20-10 or
> 50-20-20-10, the A category only accidentally matched the list of
> critics that any team would argue is an A for them.  Particularly
when
> one team would claim that they only knew 20% of the critics while
their
> opponents knew 50% or more of the critics, it could be argued that an
AA
> match is not mutual at all.  Over time, as the pairing algorithms
have
> improved, the sense of how large a category needs to be to be
> computationally manageable has decreased.  But it still results in
> necessary trade-offs.  What happens when the categories are
25-25-25-25
> as opposed to 40-30-20-10.  Clearly it is not possible to have as
many
> AA matches in the former as in the latter, but do actual preference
and
> mutuality increase or decrease?  How about when we move from 4 to 6
> categories as the NDT has done or 9 as CEDA has done?  On one level,
> tournament experience has argued that both preference and mutuality
can
> improve (within limits) as the number of categories increase and as
the
> categories become smaller.  The tests that Rich Edwards and John
Fritch
> intend to do with the ordinal data that is being collected will go a
> long way toward demonstrating where the actual trade-offs and
optimum
> limits appear.
> 
> But "categories" present inherent problems regardless of size and
> number.  It is still almost never the case no matter how many or how
> small the categories that they will match the intuitions of any of
the
> teams in the tournament.  When I tell every team in the tournament
that
> they will all have 20 or 40 or 60 1's (A+'s), that category
membership
> will never mean the same thing for every team.  Categories also
create
> myths about mutuality.  We assume that a 11 is a mutual match while a
12
> isn't.  But if we assume 6 categories (20-20-15-15-15-15) for sake
of
> argument, a 11 match might be 1-20 while a 12 match might be 20-21
(of
> course it might also be 1-40).  And one of the teams might have
judges
> in their 1 group that they wish were 2's while the other team has
judges
> in their 2 group that they would have been satisfied if they had to
be
> 1's.  So who is it an off-match for?  As the number of categories
> increases actual mutuality generally increases but the perception of
> mutuality might decrease.  When the NDT split the A group into two
> halves A+ and A, one of the outcomes was that there were more A+A
> matches.  In some minds this creates the impression of decreased
> mutuality (more off-matches).  But if you start with AA matches that
now
> include A+A+, AA and A+A you've clearly increased mutuality rather
than
> decreased it.  A similar argument happens at CEDA Nats, While we try
> very hard to avoid it, there are some 13 matches.  This creates the
> perception of a clearly unacceptable lack of mutuality even though
in
> percentage terms even a 14 match would have been a mutual AA in old
NDT
> parlance and an A+A (in almost every case) at present.
> 
> At CEDA a few years back we tested a solution to the problems
created
> by categories by using ordinal rankings.  Everyone ranked the 165
judges
> 1-xxx (a few used ties but most struggled through the process of
> creating unique ordinal rankings for each judge).  On one level, the
> experiment was a success.  In the aggregate teams received
significantly
> higher preference and higher mutuality than they would have received
if
> the judges had been assigned to the four categories that the
tournament
> used the previous year or the nine categories that the tournament
has
> used since.  But participant satisfaction was not as high for many
at
> the tournament.  Most disliked the ranking task and realistically
NOBODY
> has reliable and valid judgments that would rank judges from 1-165. 
> Additionally, folks lost the satisfaction of getting a 1 (or an A). 
> Even though an ordinal 14 when your opponent got an 11 is more
mutual
> and higher preference than the majority of AA or 11 placements, it
> doesn't "sound" or "feel" like it is.
> 
> More critically, ordinal preferences impose a different kind of
> structure on the intuitive preferencing of judges that might be just
as
> suspect as the placement of judges into arbitrarily defined
categories
> of specified sizes.  It assumes that the difference in preference
> between any two critics is equal in magnitude to the ordinal
difference
> in their ranking.  This clearly violates the clustering that
intuitively
> defines most of our judgments.  It also imposes a linearity on the
> entire population of judges when they most probably fall into a
> bell-curve (normal) distribution.  It doesn't respect the natural
breaks
> within the distribution nor does it recognize that the distance
between
> the 10th and 15th judge in my ranking might be MUCH greater than the
> distance between my 95th and 100th out of 180 (a true observation
for
> which I am indebted to John Fritch for pointing out).
> 
> Hence an experiment:
> 
> I played with a system awhile back that permitted everyone to grade
> judges just like they grade other performances.  Instead of putting
> judges into pre-defined norm-referenced groups (essentially curving
> judges by saying that x% get A's ...) we could let teams put judges
into
> criteria-referenced groups.  We could permit teams to assign judges
a
> score of 1-100 (or 1-30 or whatever) where there were no external
> mandates on the distribution of scores that any team gave to the
judges.
>  Of course, the problem of incommensurability immediately becomes
> evident.  How do we define mutuality when everyone has a different
> distribution?  One of the advantages of categories is that they
impose a
> kind of commensurability even if it is an illusion.  And what would
> prevent a team from rating ALL of the judges low, not because they
> wanted mutuality to affect the assignment of the judges they get but
> rather because they want to adversely affect the judges their
opponent
> gets.   If I think that I like 15% of the judges and my opponent
likes
> 65% of the judges I might despair of meeting my opponent with one of
my
> 15%.  So I might race to the bottom, hoping that I can keep my
opponent
> from getting any of their 65%.  Even if I discount the "perversity"
of
> this possibility, how do I ever define mutuality other than pray
that
> everyone has basically the same strategy for assigning ratings (sort
of
> like speaker points isn't it).
> 
> Oddly enough, the technological solution that works like speaker
points
> is readily available (and quite defensible) through the use of
z-score
> normalizations.  If we were willing to forgo some of the transparency
of
> current categorization schemes we could achieve arguably the
simplest,
> most elegant and arguably most fair way of providing true mutuality
and
> preference by doing the following:
> 
> Let every team assign ratings to every judge in the pool using
whatever
> distributional logic that makes sense to them.  It could have many
(or
> no) ties, it could have a high mean or a low mean ...  Additionally,
> teams wouldn't have to redo their ratings for different tournaments
> (unless their opinion of the judge changed).  Teams wouldn't have to
> count slots, tournaments wouldn't have validate them.  Then when the
> tournament uploaded pref data for the judges who ultimately appear
in
> the pool (with much less concern about judges entering or leaving
the
> pool at the last minute as long as ratings for all judges potentially
in
> the pool exist), the software would normalize each team's
distribution
> of ratings using the z-score transform.  The mutual preference task
is
> then to maximize ratings while minimizing the mutuality difference
as
> represented by the z-score.
> 
> On one level, this is exactly what we wish to do with our intuition
of
> mutuality.  We desire that within the distribution of judges for the
two
> teams, the assigned judge occupies the same place.
> 
> Perhaps the biggest disadvantage is that it might seem like voodoo,
> placing even more power into the black box (just like high/low
bracket
> pairings are no longer very inspectable unless one has all the data).
 I
> would be glad to answer questions as to how it would work and how it
> would treat the various anomalous distributions that teams might
create.
>  But based on simulations, this solution combines much greater ease
of
> use for the end users while more closer matching their actual
intuitions
> about the judging pool.
> 
> I would propose that we discuss it and then that one or more
> tournaments volunteer to test it as an alternative.
> 
> GARY
> 
> _______________________________________________
> eDebate mailing list
> eDebate at ndtceda.com 
> http://www.ndtceda.com/mailman/listinfo/edebate 
> 
> 





More information about the Mailman mailing list