[eDebate] Making the experiment less 'risky'
Morris, Eric R
Tue Nov 6 11:15:07 CST 2007
If we end up with a 100 point scale with guidelines, then I will try to
follow that scale and hope others will too.
The lowest risk version of this is probably the early suggestion for
decimal places (28.4, etc.).
Of course, the "20" is superfluous, since basically all speaker points
start with that number. "8.4" is just as meaningful, and would save some
data entry time.
"8.4" is pretty similar to "84". Just move the decimal point.
Thus, Wake could say use a 100 point scale where you start with the
formula (X-20)x10 and then make quality gradients from there.
Thus, if Wake decides to put me in charge of writing the "scale", here
is my proposed text:
"Begin with your preferred points on a 30 point scale. Subtract 20.
Multiply by ten. Make minor adjustments, using whole numbers as you see
fit. For example, a debater might give a performance you would have
called 28.5. Subtract 20, leaving 8.5. Multiply by 10, so the score is
85. Feel free to move the number up or down up to 2 points to make a
finer quality distinction."
The benefit to allowing quality gradiations is achieved with less risk
that individual judges will privilege or harm debaters by using
different mid-point assumptions. That feels like the best of both worlds
- shaking things up more might be interesting but is probably NOT the
best of both worlds.
A couple of possible addendums:
1. Wake could require that no debaters share the same number is a
possible addition. Thus, a block 28 round might end up with 82, 81, 80,
and 79 - probably in rank order.
2. Any points under 60 could require verification of intention, as
with low point wins.
3. The scale could cap at 99 instead of 100, in case the 3rd whole
digit creates programming hassles. I doubt it would, since we have 3
significant digits now (though we only "use" two of them), but I'd defer
to him on that question.
From: edebate-bounces at www.ndtceda.com
[mailto:edebate-bounces at www.ndtceda.com] On Behalf Of Gary Larson
Sent: Tuesday, November 06, 2007 10:36 AM
To: edebate at ndtceda.com
Subject: [eDebate] A less risky experiment
While I'm always intrigued by the opportunity to research something
(rather than actually having to do it), we would need to clearly
understand the limitations of the "controlled" study that Stefan and
David are proposing. I'm not opposed to collecting two different scores
for each debate. I'd even to be open to arguments as to which of the
scores counts towards the actual results and which is simply correlated
(though all of this is really WFU's decision to make rather than mine).
But the proposed experiment would prove to be the classic case where
studying a behavior potentially changes the behavior in question. When
we start the exercise by saying, "assign the points that you would
assign on a 30-point scale and then indicate the points you would give
on a 100 point (or 50 point) scale" we have no idea how the process of
identifying two different scores AND consciously correlating them
impacts one or both of the scores assigned. My first hypothesis would
be that the scores on the 30 point scale would NOT have, in fact, been
the scores that would have been assigned had the scale research not been
in progress. Imagine the judge that now gives 27.5's to 80% of the
debaters that they judge in a tournament. I can imagine that for some
such judges, they "might" argue that it was just a matter of
discrimination - none of those debaters were quite good enough to merit
a 28 and none were poor enough to get a 27. But for others, the 27.5 is
just a polite fiction or convenience. They did enough work deciding who
won the debate. I suspect that the experiment would result in some of
those judges giving a broader range of scores on the 30-point scale.
If that happened, wouldn't it just prove that reform was unnecessary -
that judges could reconceptualize the current scale and produce a more
discriminating set of scores? Perhaps. Though without the force of the
experiment , we'd be back in the SQ where friendly persuasion hasn't
done much over the years. But some others might actually give a narrower
range of scores. If they start the exercise by using the 100-point
scale to provide discrimination they might discover that all of those
scores can translate back to the same 30-point equivalent. Once again,
writing two scores down on the ballot with explicit instructions that
they should correlate in accordance to a predefined conversion scale
might influence the scores assigned - including the 30-point versions.
The other question I have is how we would evaluate the outcome of the
experiment. David provides one metric. The new scale succeeds if we
get a normal distribution of +/- 2 points surrounding each of the scores
that represent the 30-point scaled score times 3.3. Of course, that's
assuming that the current scores are correct and that the only issue is
increased discrimination (inflation being irrelevant). To be honest,
one of the dilemmas whether we change the scale or just do the research
is that we don't really have any control. There is no "right" score,
correct seed orders, or correct speaker awards to which our outcome can
be compared. If the two scales produce slightly different outcomes, we
have grounds for arguments but not really for conclusions. And if the
two-score research creates an extremely high correlation that in itself
doesn't prove that the reform is unnecessary since the research itself
might have prodded the outcome.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Mailman