[eDebate] Mutual preference experiment - the results

Gary Larson Gary.N.Larson
Wed Nov 22 16:09:00 CST 2006

The results - the discussion thus far has focused on the input data
distributions and the anomalies that they sometimes created.  This
should not obscure the overall success of the pairing at both Kentucky
and Wake.

Since reporting ratings is misleading since the distributions of each
team are incommensurable (with means ranging from 24 to 60), aggregate
data reporting for comparisons requires translating the data into
ordinal or categorical equivalents.  In that regard, two caveats are
necessary.  First, the ordinal data reporting counts ties as the highest
rather than the mean value of the range represented by the tie.  This
follows the convention that when you say a judge is ranked 35th that
there are 34 judges in the pool ranked higher (but no conclusion can be
made as to how many are ranked lower).  While this represents our
practice in treating all members of the same category as "equal" it
might overstate the data in some peoples minds who would say that not
all tied judges between 35 and 40 deserve to be equally treated as 35. 
While this convention does have an impact on the reporting, I don't
think that it represents an overstatement.

The translation from ratings into categories is slightly more
problematic in two ways.  As noted in a previous post, the boundaries of
teams clustering of judges rarely corresponded equally to traditionally
defined categories.  As a result, several ties crossed a category
boundary.  For instance, if a team had tied ratings for judges 16-21 at
Wake where each category 1-6 would have had 18 judges, all of the tied
judges would be included in category 1.  In many cases, given the same
situation at a typical tournament, a team would have forced the tie to
be broken so that they only included 18 judges in category 1.  But for
reporting purposes here, the software has no alternative but to count
judges 16-21 all as 1's.  It would actually be an open empirical
question as to whether the team would be better or worse served by
arbitrarily breaking the tie (if it is real) to ensure that they don't
go over the quota of 1's (something they are always permitted to do.  On
the flip side, a number of teams adopt the strategy of assigning excess
1's in order to increase their odds of getting them.  Assuming that
there wasn't a tie crossing the boundary, the software limits the
reporting of 1's to the top 18 judges.  Taken together, the two issues
tend to cancel each other out in the reporting of results, but in all
fairness the first scenario is more frequent than the second.  As a
result, the categorical equivalents are slightly overstated from what
would have been the case if the categories were imposed by the teams
rather than the software.  My estimate is that this overstatement may be
on the order of .10 to .15.

That said, the results of the experiment and the new algorithm remain
striking (even with the difficulties created by the lack of external

Translated into 1-9 category system

Teams at or above break including presets

      Wake 06     Wake 05      Kent 06     Kent 05

1       1.73          2.42             1.97         2.88
2       1.83          2.42             1.92         2.71
3       1.79          2.13             2.09         2.65
4       1.63          2.20             2.32         2.91        
5       1.57          1.69             1.66         2.16
6       1.61          1.84             1.67         2.35
7       1.63          1.90             1.53         2.20
8       2.21          1.76             1.74         2.11

TTL   1.73          2.09             1.91         2.56

All teams - whether in or out of contention

      Wake 06     Wake 05      Kent 06     Kent 05

1       1.73          2.42             1.97         2.88
2       1.83          2.42             1.92         2.71
3       1.79          2.13             2.09         2.65
4       1.63          2.20             2.32         2.91
5       1.72          1.85             2.16         2.40
6       1.63          1.98             2.24         2.52
7       2.28          2.34             2.34         3.10
8       2.70          2.50             2.80         2.63

TTL   1.91          2.23             2.23         2.73

In both of the above tables, the results are striking in that even with
the caveat that category equivalents to the rating data might overstate
the results (as noted above), Wake 06 proved to be better than Wake 05
in aggregate and for all but round 8.  Of course, round 8 results
concerned a number of folks this year (even though they should be put in
an overall context).  The solution can go several different directions. 
First, as can be noted from Wake 05, late prefs can be improved by being
less aggressive about early prefs (preserving judge commitments).  This
is a typical strategy that doesn't require the radical step of using
random judges.  A tab room can commit to using all committed judges in
presets AND preserving high-pref partial commitment judges while
maximizing mutual preference WITHIN those constraints.  Because of the
change in the form of data that gets presented to the tab room as a
diagnostic in the new system, it was unfortunately too easy to be too
aggressive early.  Fixing that "problem" comes with experience.

Kentucky 06 also proved to be better than Kentucky 05 even though it
operated with an extremely tight pool (the principal difference between
Kentucky and Wake).

A more profound indication of the ultimate success of using the power
of the algorithm with more discrimination can be found in the data when
it is translated from ratings into ordinal equivalents.  We end up with
the following:
Ordinals are out of 159 judges at Wake and out of 122 judges at

      Wake 06           Wake 06          Kentucky 06    Kentucky 06
     In contention    All rounds        In contention    All rounds

1       20.20             20.20                  20.42             
2       21.91             21.91                  18.77             
3       21.91             21.91                  21.88             
4       19.06             19.06                  25.77             
5       17.65             20.43                  15.57             
6       17.97             18.29                  14.39             
7       18.96             30.69                  11.58             
8       28.35             38.30                  15.94             

TTL   20.46             23.85                  18.83             

In other words, since 18 judges would have been categorized as 1's at
Wake, the results mean that the AVERAGE judge assigned for all teams in
contention was just barely into the 2 range and that even for all teams
in the tournament the overall average was better than the middle of the
2 range.  The differences between Wake an Kentucky reflect that fact
that Kentucky operated with a much tighter pool, tournaments generally
improve pref the larger the overall pool (particularly if there are more
partial commitment judges - Wake had 5 more teams and 37 more judges),
and the fact that Kentucky had less fall off in round 8 was a result of
relatively "worse" performance during presets.

Addendum to discussion of rating data in previous post:

One of the most profound indicators of the skew in the distribution is
found in the fact that 15 teams had average ratings of less than 30 and
in each case the standard deviation was greater than the average.  What
that means is that for such teams, the z-score of a 0 was actually LESS
than -1.

More information about the Mailman mailing list