[eDebate] Mutual preference experiment - the data

Gary Larson Gary.N.Larson
Wed Nov 22 14:07:14 CST 2006

As I report the distribution of ratings assigned, it is important to
remember the caveat that the ratings represent either or both an attempt
to encode actual preference judgments for each of the critics rated OR
an attempt to create a distribution that has the goal of maximizing the
assignment of some critics while minimizing the assignment of other
critics.  Because of the latter, I can never confidently assert that the
judge ratings represent the former.  When a team rates 70% of the pool
as 0, I have no way of knowing whether any individual 0 was really a 0
equal to all others or whether they were a 50 rated as 0 for the hopeful
purpose of maximizing the chances of getting higher judges.

Given that caveat, what did the data show?

Overall average - 45.51
Overall stdev - 34.09

Hypothesis 1:  Does the clustering of the data for most teams fall into
categories as typically defined?

One of the relevant questions is "how many" different ratings did teams
use to encode their preferences.  If most teams only used 10 or fewer
categories, it would be easy to argue that there is no need for
additional ones.  For those who completed ratings, the range of number
of distinctions made by teams ranged from 2 to 100.  That's correct -
the minimum was 2.  One team choose to rank 42 judges as 100 and 115
judges as 0.  The mean number of distinctions made was 29 and the median
was 21.  Only 7 teams used fewer than 10 distinct ratings.  So there
would be a warrant to say that teams perceive that more than 9
categories continue to permit them to make meaningful distinctions
between judges.  But at the same time only 3 teams used all 101
available ratings so there may not be a warrant to argue that folks
would find it meaningful to encode ordinal rankings at large tournaments
(unless ties were permitted/encouraged).

Beyond counting the number of distinctions that teams used, the next
question raised regarding how the data would/could into an appropriate
category scheme is whether the ratings clustered consistently.  In other
words, if two teams both used 15 distinct ratings, would there be a
consistency in how many judges they assigned to each rating?  Do the
natural breakpoints in their distributions fall at the same place?

One of the strongest findings regarding the data is that for any two
teams that encode the same number of distinctions, the distribution of
ratings within those "categories" is NOT consistent.  In other words,
while the imposition of an arbitrary number of categories doesn't seem
to violate the ranking strategy of most teams (as long as the number of
categories is sufficiently large - arguably 15-20 for a tournament with
159 judges), the imposition of quotas on each of the categories does not
appear to match the actual ratings provided by teams.  If the ratings
are taken at face value, that proves true whether the categories are
defined in advance as having the same size or arbitrarily defined
different sizes.  SO, having category quotas rather than having
categories per se might be the issue.  As a result, the definition of
the quotas are an issue that deserves further discussion if category
systems are to be used.

Hypothesis 2:  Does the distribution of ratings fall into a "normal"
bell curve?

Once again with the caveat that ratings often represented a variety of
strategic choices, this hypothesis was strongly disconfirmed.  While it
was the case that a number of teams did create bell-shaped
distributions, a number of others adopted a more linear model - at times
attempting to simply encode a 9 equal-sized category coding into a 0-100
equivalent.  More notably, several teams did adopt bi-modal
distributions.  When balanced all together, the following represents the
distribution of all 21555 ratings that were assigned.

100      1287	0.06
91-99     766	0.03  0.09 (91-100)
81-90   2065	0.10
71-80   2155	0.10
61-70   1707	0.08
51-60   1486	0.07
41-50   1742	0.08
31-40   1361	0.06
21-30   1661	0.08
11-20   1733	0.08
1-10     2113	0.10  0.26 (0-10)
0          3479	0.16

In addition to the overall lack of a bell curve, the skew of the
distribution is pronounced with 26% of all ratings being 0-10.  As noted
in the previous post, this negative skew had two potentially undesirable
effects.  First, it meant that several judge assignments particularly
late in the tournament were lower than might be expected.

As an important side note here, it is remarkable how much perception
and labeling affects reality.  It is clear that in terms of actual
ordinal equivalents, both Wake and Kentucky exceeded performance of
previous years (in aggregate).  But the perception didn't always match. 
A team who received their #5 ranked judge when that judge was rated 100
was happier than a team who received their #5 ranked judge who was rated
a 75.  As noted before, at Wake a placement of an ordinal 45 seemed to
be inadequate where calling the same match a 3 seemed to be pretty good.
 Since we think in terms of grading scales from our classes in school
where an 85 is a B or a C depending on curve, it's easy to forget that
our 85 on the rating task (where 45 was the mean) might be firmly in the
A range in our distribution.  It's remarkable how these perceptions
work.  I was intrigued to hear of a tournament with three categories
call those categories A, A- and Strike.  I call my 9-categories
A+,A,A-,B+,B,B-,C+,C, STRIKE but prefer to just use the numbers. 
Bruschke's web entry system would call the same categories
A+,A,B+,B,C+,C,D+,D,F.  While the reality is the same, the perception
might be very different.

The second negative affect of the skewed distribution is that judges
who would otherwise be difficult to place became exceptionally difficult
to place AND when they were forced into the pairing, the resulting match
appeared significantly worse (even if it wasn't in ordinal terms).  The
lack of limitations on how many judges could be ranked extraordinarily
low did have an impact in final pref results.  Having, said that,
however, I shouldn't understate the fact that the system handled the
incommensurability of different rating distributions remarkably well
EVEN in the face of distributions that radically departed from

BUT, given the fact that ratings will also represent a set of strategic
decisions as much or more than they represent the true "preference"
associated with each judge, any "fairness" metric and even any
statistical outcome metric will always be flawed.  Given the fact that
the ratings don't end up distributing normally, given the fact that they
are significantly skewed in favor of lower ratings, and given the fact
that a lack of externally imposed constraints will continue to produce
uncertainty regarding the limits of creative strategies (e.g. how many
0's can I assign before I have the risk of having one or more of them
assigned), I strongly suspect that the option of having a universal
rating scheme without any constraints can't be sustained.

More information about the Mailman mailing list