[eDebate] 50 point scale double check
Thu Nov 1 21:46:33 CDT 2007
As one of the tab software developers that will have to implement whatever solutions we ultimately adopt, let me make a couple of comments about the limitations of technology (both in a practical and a philosophical sense).
>From a practical point of view, I strongly prefer solutions that increase the range of points as opposed to those that try to increase their precision. One of the important practical concerns at a large tournament is that the tab staff needs to be able to enter points very quickly so that tabbing doesn?t delay future pairings. At present, there are short-cuts that minimize the number of keystrokes required to enter points from 20-30 OR 40-50 in half point increments. Those short-cuts go away if we have to consider all decimals (or even the ? points proposed by USC). I genuinely believe that a re-calibration of our points to include all the full-point (or half-point) options between any two 10?s digit will provide an appropriate balance between precision and reliability. In fact, a 10-option scale where all are reasonably used in a normal distribution would probably get us close to where we need to go (20 options might be overkill causing us to lose inter-rater reliability with the increased assumed discrimination).
I?m actually more concerned about introducing potentially opaque statistical solutions. Even z-scores or variance can turn results into a black box that makes the outcome uninspectable (except with great effort) and which lacks intuitive faith in the results. Expanding that to include season-long variance increases the data management tasks of the tab software and makes the outcome even less verifiable. Additionally, the whole premise of variance is that the observed is held constant and we are measuring and correcting variability among the observers. Rather than improving reliability by increasing the sample size, we decrease reliability by radically decreasing the comparability of the samples of debates seen by each of the observers. While we imagine that over time all judges could judge all debaters, nothing even remotely like that happens. As a result, season-long sampling would create the false assumption that a judge will observe a similar sample of debates at any given tournament to the sample they have judged the rest of the year. A better way to increase sampling discrimination would be to have judges rank all debaters they?ve judged at a tournament from 1-x rather than just 1-4 in each round. But that probably wouldn?t work given the widely variable number of rounds judged by each individual.
In the same way, creating complex formulas that include more factors to ?correct? any individual assigned score makes the outcome a ?magical? and rather unsatisfying surprise that just pops out of the machine (sort of like the computer contribution to the BCS).
Finally, one comment on how badly broken the system is. On the negative side, we suffer from a lack of discrimination when many/most judges give 90% of their scores in a range of 3 options. Increasing that to even as many as 10 realistic options would solve most of the problem if it was done consistently. But, as I?ve said before, points are still a remarkably good indicator at the team level (if not at the individual level). Once again at the NDT, speaker points proved to be an even better predictor than record to determine who would win any individual debate.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Mailman