This group from science and academia believes there’s a better way for dressage riders to earn their scores.
Michael Kierkegaard, in his essay “We Should Demand More In Our Dressage Performance Standards,” once again raised the issue of performance standards for lower level riders in the United States.
The results from the Olympic Games competition in Hong Kong last summer also created lots of discussion in the international community and led to the creation of the Fédération Equestre Internationale Dressage Task Force, which is currently reviewing several issues that arose from that venue.
All of these discussions have some common points:
• Variability between highest and lowest scores given by judges for a single ride.
• Effect of judge’s position (C, E, B or quarter lines) on the score.
• Proposal for using half-point increments.
They also make some common mistakes as they work through their varying points of view:
• A belief that performance standards at the national level will automatically increase U.S. performance results on the international stage.
• A belief that general performance at the national level correlates to performance on the international level.
• A belief that many riders willfully ride above their level of expertise.
What they fail to discuss is the nature of the underlying problems with the way dressage is currently judged and scored. Until these weaknesses are addressed, the results of any class must be suspect, and no system of performance standards can be trusted to produce the desired results.
Accuracy Vs. Precision
To understand the issues we see in the current judging format, we first need to clarify the concepts of “accuracy” and “precision.” To properly discuss the use of half-points in dressage, we need to understand the definition of “significant figures.”
A clear understanding of the language, and importance of these terms, is needed to properly engage in a discussion for improving dressage judging.
• By definition, accuracy is: How close you are to “truth.” In archery, how close one gets to the bull’s eye.
• By definition, precision is: Measurement error. Consistency in multiple measurements. Typically this is thought of as reliability, reproducibility.
Measurements can be accurate but not precise. They can be precise but not accurate. Measurements can be consistently correct, or consistently wrong. A measurement system (dressage scoring is a measurement system) is valid only when it is both accurate and precise.
Significant Figures: Are digits introduced by calculations carried out to greater accuracy than that of the original data, or measurements reported to a greater precision than the equipment supports.
When studying the “resolution” of a measurement system, this is usually one-half of the smallest scale an instrument has. Thus, if a ruler measures in one millimeter increments, then we can visually interpolate to half that and no finer.
ADVERTISEMENT
In dressage, we score in whole numbers; therefore, the final scores should be reported in no more than half points. The fact that the calculator can compute scores to three places after the decimal point does not mean those additional digits are useful, accurate or relevant.
In dressage, the “measurement system” is composed of two parts:
• The “instrument,” e.g., the dressage test itself (think of this like a yardstick), and
• The skill of the “operator,” e.g., the judge using the yardstick.
How these two parts work together determines the accuracy, precision and significance of the final score awarded to a horse/rider pair.
What Does This Mean?
The concept of accuracy is the key concept at the center of debate in dressage judging, yet it has been given little attention.
Accurate dressage judging requires universal agreement on the standard to which the operator (the judge) is judging. For dressage, this means there must be universal agreement on the correct bearing of the horse. The concept of accuracy, e.g., the standard for what judges are looking for, is front and center to the debate of whether judges are judging to the rules, as well as variability in judging.
Since no frame of reference in subjectively judged sports exists, one has to look to industry for examples.
The analog in industry is the international standards on time and distance. The industrialized world subscribes to international agreements supporting the “International System of Units.” These standards are maintained in France by the Bureau International des Poids et Mesures. These “standard measures” assure that one meter measured in Paris is the same as one meter measured in Washington, D.C.
The definition of one meter evolved from being a fraction of the distance between the equator and the North Pole (in the 1700s), to a platinum alloy bar kept at the melting point of ice (until the 1960s). Today a meter is defined by the speed of light in a vacuum.
In all cases, there was ongoing refinement in how precisely the distance was measured, but everyone agreed to abide by the same definition of the standard in use at the time (accuracy).
These changes arose at the technology allowed and the industrial needs arose for finer and finer slices of measurement. This concept is related to the idea for “half-points.”
Precision is the measurement error.
ADVERTISEMENT
Part of that error is introduced by the use of the instrument, the yardstick or the dressage test. Part of the error also is introduced by the operator or judge. Precision is influenced by the operator’s training and his or her use of the yardstick or test. It also requires that the judge have clarity to the standard to which he or she is judging against.
Regarding precision and accuracy, we would offer that in dressage there’s a high diversity of opinion on how the rules should be interpreted and how the 10-point scale defined in the rulebook should be applied. There appears to be no universal agreement on the standard or the scale. Therefore, it’s no surprise that there’s variability in judging amongst the judges. And less surprising that most judges tend to stick to the “safe” 5-6-7 range of scores.
Furthermore, we fool ourselves by thinking that we’ve achieved a fine-grained sorting of a class when the final scores are reported in hundredths of decimal points. A 0-10 scale cannot produce that level of precision.
There’s no difference between the scores of 72.12 percent and 71.13 percent. The concepts of half-points to improve significant figures should not even be considered until a common standard can be established and its consistent use verified.
Why embark on a scale with finer and finer cuts for a measurement when the goal to which that measurement is to be judged against isn’t clear?
Our Proposal
If a quality control professional were engaged to look at dressage judging, he or she would assess the current state before embarking on any repairs. This would be done as follows:
• Looking at scores and videos of past rides, one would use standard statistical tests (ANOVA, 1 and 2-sample t-tests) enabled by conventional statistical software (Minitab, SAS-JMP) to analyze variability in judging.
• To date, that has not been done in depth, though the analysis of 2008 scores by the Nerd Herd did a preliminary look at this, which indicated problems exist.
• During a judges’ training session, one would conduct a “measurement systems analysis” to robustly quantify the degree of agreement between the judges (operators).
• There are standard statistical tests and methods by which this is done every day to certify subjective evaluations (for example, interpreting medical tests such as ultrasound, MRI and CT scan results). To date, that has not been done.
• Finally, depending on the two results of above, then determine and implement the “fix.” Improvement could be as simple as enhanced judges’ training.
Based on the controversies currently around, the first need would be that the dressage world revisit its rules and how its leaders choose to apply and interpret those rules in the competition, e.g., revisit how we are judging to the rules.
Dressage is scored on a 0-10 scale, with final scores reported as percentages, more as an accident of history than due to any well-researched analytical process.
Mathematical research since the 1912 Olympic Games has progressed enormously and can now be used to access scoring/voting systems for accuracy, precision and applicability to the problem at hand. It can also be used to access issues such as Inter-Rater Reliability (how well individual judges conform to the established standards for judging).
It’s time to use the tools developed in industry and academia to improve how riders are evaluated when they present themselves on the centerline.
The COTH Nerd Herd was formed on the Chronicle Forums (www.chronicleforums.com) in response to a request for peer review of the “Dressage Score Analysis” paper written by Yount, Diaz and Johnston in October 2008. Each member—Ana Diaz, PE, Lita Dove, S.M.L. Gray, Jacqueline Greener, Amanda M. Jay, Mary Stydnicki Johnston, Jennifer Lucitti, Ph.D., Katy Moran, Ph.D., Wendi Neckameyer, Ph.D., Caryn Vesperman and Lori Vogt, Ph.D.—has extensive professional experience in using statistics, measurements and analysis tools to improve and validate work results.