[eDebate] Fwd: Re: Mutual Preference at Kentucky

Gary Larson Gary.N.Larson
Fri Sep 8 09:30:38 CDT 2006


Since several folks might be having questions as to what the new system
means and how it impacts their decision-making process in assigning
ratings to judges at Kentucky, I will forward e-mail exchanges that I
get from folks (assuming I have their permission).  The following
represents an exchange with Joe Bellon.  To read it, start with the last
message and then work your way up.
 
While it won't eliminate everyone's angst, I want to encourage folks to
simply assign ratings to judges based on how happy/satisfied you would
be to have that judge in the back of the room (with whatever variables
make up that calculus for you).  I think most of the angst will be
created by wondering how other folks will do their ratings and whether
as a result of figuring out how they do it, you can fill out your
ratings in such a way as to maximize the possibility that you will like
the judge in the back of the room better than they do.  Worse yet, we're
anxious about the possibility that someone or perhaps everyone else will
have figured out a magic way to rate judges so that they will like the
judges in the back of the room more than you do.  While it might seem
idealistic, I'm hopeful that we won't lose sleep over attempting to
create either offensive or defensive strategies to game the system. 
Beyond telling you that they are unlikely to work no matter how much
sleep you lose, I'm hoping that we can answer some fundamental questions
about our potentially differing experiences with the population of
judges by just rating them based on our own judgments.

>>> On Fri, Sep 8, 2006 at  7:52 AM, in message
<4501213B.3F5C.0033.0 at wheaton.edu>, Gary Larson wrote:
Regarding you worries about the ends of the continuum, from a
statistical point of view it makes very little difference once scores
are translated into z-scores.  If one team rates all judges between
70-100, another team from 1-30 and another team from 0-100, as long as
they all are "normally" distributed they would be essentially identical
distributions once they are transformed by z-scores.  The only question
for mutuality is how many stdevs a score is above the sample mean for
that team's distribution.  Even issues like skew and kurtosis are taken
into consideration - though not with the same degree of discrimination.
 
The question of whether two teams not only have a different set of
ratings but even a fundamentally different logic for creating ratings
will matter no more or less than in current schemes where it is already
true.
 
GARY

>>> On Thu, Sep 7, 2006 at  9:28 PM, in message
<e06af37d0609071928j782aa18k48bd85b32c16ce6a at mail.gmail.com>, "Dr.
Joe Bellon" <debate.gsu at gmail.com> wrote:
Thanks for the feedback. Feel free to post it publicly.

The one thing I think we are disconnecting on (assuming that's even a
verb) is the question of what 100 represents and how that affects what
we rate things. I guess I am hoping for an alternative to the Likert
scale metaphor. I like the 100 system because it starts me thinking
about things from a percentage standpoint. 

I think you've very accurately pointed out the problem with one of my
proposed metaphors -- basically, I assumed that folks would rank based
on decision-making ability when the reality is that at least some of the
issue is the likelihood a particular judge will vote for us. 

Still, I think it would be helpful to have some sort of consensus (or
at least a general idea) of what the ends of the continuum are. Is 100 a
perfect judge, or does it represent a degree of confidence that a judge
is good for us, or perhaps something else? I don't think I did a good
job of explaining what difference I think this makes in ranking, say, a
40. If we are ranking degree of confidence, then a 40 is less reliable
than a coin flip. In other words, this person will make a bad decision
more often than a good one. On the other hand, if you are ranking
perfection, then a 40 isn't all that bad. If I were 40% of the golfer
Tiger Woods is, then I'd be substantially richer than I am now. 

Anyway, as you point out, this kind of question doesn't matter within
my own rankings, but it does matter (I think, if I understand the system
correctly) if I am ranking based on confidence and Mike Davis is ranking
based on perfection. If there is a disconnect between his understanding
of the system and mine, then what the system would produce is not really
mutual. Obviously, the same could be said of the previous 1-9 ranking
system, but in that world at least we all had the same requirement in
terms of how many judges could be ranked in each category. It seems at
least possible that a much larger problem could arise in the system you
are experimenting with if we have wildly divergent understandings of the
idea behind the rankings. 

I am hoping here for both a kind of community coming-together over this
issue, but I am also hoping you will persuade me that I'm wrong. I
really like the idea of your system and want to find reasons it will
deal with my concerns. 

Thanks,
Joe

On 9/7/06, Gary Larson <Gary.N.Larson at wheaton.edu> wrote: 

>>> On Thu, Sep 7, 2006 at  4:15 PM, in message
<e06af37d0609071415n375d47d8w5732443e45802290 at mail.gmail.com >, "Dr.
Joe Bellon" <debate.gsu at gmail.com> wrote:
 
You ask a very a propos question.  I prefer to think of the 100 point
scale in similar terms to your Likert scale metaphor (or stanines).  I
would assign scores between 90-100 to those you would give 1's, 80-90 to
those you would give 2's, 70-80 to those you would give 3's, etc.  The
reason I give 100 values rather than 9 or 10 values is that I believe
that folks may vary in terms of how much discrimination they think that
they can apply and it may well be that different degrees of
discrimination are possible in different parts of the range.  I may be
able to distinguish between my top ten judges but not between my middle
20.  If I can't and want to give them all the say rating, the system
permits it.  Even more critically, it differs from the simple 1-9
categories by not mandating in advance any quota for any of the ranks. 
I'm genuinely curious as to how much difference that artificial
categorization makes. 
 
Regarding your second question - are we asking whether you consider a
judge to be best if you are 100% confident that they will make the right
decision win or lose OR 100% confident that they will vote for you? 
This is a very real question and part of the reason that we always have
to have MUTUAL as part of mutual preference.  I strongly suspect that
rating judges primarily on whether they will vote for you results in a
greater number of non-mutual decisions.  If I know that a judge is a
homer, presumably other folks do to which means they are less inclined
to rank them high.  So whether everyone did what I WISHED they would do
and ranked judges based on a confidence that they would make the right
decision, I suspect that even practicality dictates in the end that the
most highly rated critics and the most mutual critics are those with
whom we are confident in their decision-making. 
 
You mention that you would never say that you are less than 50%
confident that they will make the right decision.  I hope that that is
absolutely true.  Still if 50 is the mean, some judges fall below it. 
Viewing it as a "grade" still permits high scores and low scores.  It is
somewhat arbitrary that we have speaker points where less that 27 is
"awful" or exams in school where less than 88 is below average. 
Particularly since this is a subjective judgment you can set the mean
wherever you want.  I suggested 50, but even if you set the mean at 85,
the z-score normalization will still provide commensurability with
another team that sets it at 40. 
 
Regarding the strike (or lack thereof), unless absolute mutuality
trumped everything, this system will have the same distribution of
judges we typically have.  Mutuality is only a factor in the mutual
preference calculus.  If we have 4 categories with 40% A's, we can
probably dictate that all judges need be categorically mutual.  When we
have 9 equal-sized categories we permit off-1's at the top and off-2's
at the bottom because it is "mutual enough" and actually more mutual
than the 4-category alternative.  If we have 100 categories, it may well
be that off-10 is mutual enough.  So I would be willing to say that for
teams not eliminated, you can be almost absolutely assured of having
judges in the top 40-50% of your range.  I'm guessing we'll do better
than we've done with 9-category systems so you don't need to be any more
worried about your ratings than you would have been with your 6-9's. 
Once teams are eliminated we are still qui! te committed to staying in
the top 2/3's unless we're absolutely stuck.
 
Thanks for the questions - can I post the exchange to edebate since
others are likely to have similar questions?  

GARY
Gary, this sounds very interesting. We spent a long time discussing
such a system last year at D6 regionals, and I am thoroughly excited
about anything that prevents me from having to re-rank judges at every
tournament. I have a non-mathematical question, though, that I am hoping
will help me understand the guiding philosophy behind the system. 

What does the percentage represent? I think we (or, rather, I) need a
metaphor to help me figure out how to rate judges.

As an example, I always saw the 9-point ranking system as a kind of
Likert scale -- a 9 is highly preferred, a 1 is strongly non-preferred,
and so on. When I think of percentages, there are a lot of competing
metaphors that come to mind. Is 100% a 100% perfect judge?
Alternatively, I could think of 100% as representing a degree of
confidence in a given judge -- so, for example, I might b! e 100% sure
that Calum Matheson will make the right decision in a given round. 


These metaphors make a big difference to me. If we are judging
percentage of perfection, then I might think a 40% judge is pretty good
even though they are below the midpoint. On the other hand, if we are
judging confidence in decision-making skills I would never want a judge
under 51%. There are obviously other possible interpretations of
percentage, but if there isn't some kind of guidance on this point then
I worry that we will replicate the differing assumption issue that
plagues ordinal ranking systems. 

Additionally, is there such a thing as a strike in this system? If I
rank someone a 0%, is there any chance they might judge my team?

Anyway, thanks again for all your work and for being an innovator.

-Joe

On 9/7/06, Gary Larson <Gary.N.Larson at wheaton.edu > wrote: 

As noted by JW, we are going to experiment with a new way of
establishing the "mutual" part of "mutual preference" at the Kentucky
tournament.

 
If the only issue was maximizing preference, any system of ranking
judges would be simple to implement.  We could have any number of
categories, ordinal ranking, judge ratings ...  Our goal would be to get
the best judge possible for both teams but we wouldn't worry whether the
outcome was mutual. 
 
The real issue is to measure and establish an appropriate level of
mutuality.  Each system that has been used - circles and strikes, ABCX,
A+AB+BCX, 9 cateories 1-9, or ordinal possess some advantages but also
some real weaknesses in establishing mutuality. 
 
If we have a small number of categories with a large number of judges
in each, it is relatively easy to have nearly all rounds be exact
matches, but of course what it means to be an "exact" match becomes less
meaningful since the "within-category" difference can be quite large. 
As we increase the number of categories and decrease the number of
judges within each category, it becomes more difficult to have "exact"
matches and also more difficult to pair rounds with the highest
available category.  But it is arguable that the final outcome is still
superior in terms of absolute preference and absolute mutuality, but the
community continues to have discussions about the relative merits of
using 4, 6, or 9 categories.  But regardless of the number of
categories, any "category" based system suffers from the incongruity
between in-category and between-category differences and the fact that
the tournament mandate to have a minimum quota of judges in e! ! ach of
the categories fails to reflect genuine differences between teams
regarding how many judges they might genuinely defined as A's or B's or
whatever. 
 
An alternative that was tested briefly about five years ago was to use
ordinal rankings rather than categories where it can be "proven" that
you can better maximize and measure both mutuality and preference in
continuous rather than categorical terms.  While the experiment did
result in "improved" outcomes, most teams found the ranking challenge to
be more work than they desired.  Ordinal rankings also suffer from the
"myth" that the qualitative difference between any two ranks are
proportional to the numeric difference between those ranks.  In other
words, is the difference between my 5th and my 10th judge identical to
the difference between my 25th and my 30th or between my 50th and my
55th?  This continuous linear assumption violates the intuition that
judge ratings cluster together (though not in the artificial groups
created by categorical systems) and that they might be arrayed in a bell
curve. 
 
As a result, the experiment is twofold.  We are going to permit teams
to assign ratings 0-100 to all judges in the pool with NO arbitrary
limitations on what values get assigned (though I will recommend that
folks aim for a mean of 50).  Taken by itself, this rating task will
help address the issue of commensurability.  Are there real differences
between the ways that various teams evaluate the available judging pool?
 All of our systems to date force a version of commensurability in order
to manage mutuality. 
 
The second is more ambitious.  To the extent that we discover that
there ARE significant incommensurabilities, what is the best strategy
for imposing mutuality.  The data that is collected will be able to be
translated into equivalent 4-category, 6-category, 9-category or ordinal
equivalents so that we could see how each system would function.  But
for the purpose of the tournament, we will test another alternative. 
The statistical procedure that we use for speaker point tie breakers,
z-scores, can be used to transform all of the various distributions into
more commensurable "normalized" distributions that can be used for
mutuality.  So the mutuality judgment for each judge assignment will be
based on the z-score, the number of standard deviations above or below
the sample means, for each of the two teams.  But at the same time, I
will be able to report the resulting mutuality based on how it would
have been computed in each of the competi! ! ng categorical or ordinal
systems.  So while the technology might make the outcome seem more
opaque, we will be very open in post-tournament reporting of results.
 
In attempting this experiment, I do have one strong recommendation.  In
all of the mutual preference schemes used to date, some teams conclude
that there must be some way to "game" the system to obtain better
preferences than one might expect or worse preferences for one's
opponent than they might expect.  So teams rank judges they don't want
as A's, thinking that no one else will prefer them at all, thereby
allowing them to concentrate their A's or use mutuality as a means of
increasing their strikes.  Someone will be tempted to conclude that
there must be some way to create ratings for judges that will accomplish
the same kind of outcome.  While I seriously doubt that you will
succeed, my real request is that you don't even try.  The initial and
perhaps most important result of the experiment is NOT the use of
z-scores for mutuality but rather getting an absolutely honest
distribution of ratings for each team to test our as! ! sumptions about
commensurability.  If your ratings don't reflect you genuine evaluation
of each of the judges, that foundational objective won't be met.
 
As a side note, IF this experiment is successful, the collection of
judge ratings is something that can be tournament independent since
teams wouldn't be required to re-rate judges each tournament to meet the
quotas that they would impose.  In fact, even if tournaments chose to
use categorical or ordinal systems, the data "could" be directly mapped
from the ratings that would be in the database (editable whenever teams
chose to revise their data). 
 
Thanks

_______________________________________________
eDebate mailing list
eDebate at ndtceda.com 
http://www.ndtceda.com/mailman/listinfo/edebate 






-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://www.ndtceda.com/pipermail/edebate/attachments/20060908/fb6605ac/attachment.htm 



More information about the Mailman mailing list