View Full Version : Thoughts on the 'Reliability' of Dressage Scores
ShotenStar
Apr. 9, 2008, 02:05 PM
In the continuing effort to engage hearts and minds on the topic of performance standards, I offer the following for you to think about:
There is a large body of literature and professional work in the field of 'measurement' .... measurement of product reliability and performance; measurement of characteristics over time; measurement of change (over time and between units); measure according to standards ... you name it, someone has figured out a way to not only measure it, but also to validate those measurements and measuring tools. One of those validation techniques is call 'Gage Reliability and Reproducibility Analysis' .... it answers the question: how good is your measurement system?
The literature defines this type of Gage Reliability and Reproducibility Analysis as having three aspects:
One may think of each measurement as consisting of the following components:
1. a component due to the characteristics of the part or item being measured,
2. a component due to the reliability of the gage, and
3. a component due to the characteristics of the operator (user) of the gage.
In the case of dressage tests, the 'components' are:
1. a component due to the characteristics of the part or item being measured -- the performance of the horse / rider being assessed in the ring at that moment. In the case of dressage, there are actually three measurements taken during a test: the scores for the movements; the collective marks for the horse; and the collective mark for the rider.
2. a component due to the reliability of the gage -- the scores / scoring system used in dressage tests -- the numbers 0-10 and the meanings assigned to them.
3. a component due to the characteristics of the operator (user) of the gage -- the judge and the guidance given to judges through the rules and definitions of movements / criteria for scoring.
All three of these points need to be considered when evaluating a measurement system. Right now, the discussion is focused mainly on # 1 - the performance of the horse and rider and the need to impose limits or restrictions on who may show at what levels (the Performance Standards Rule change issue). There is a certain amount of 'blame' floating around in the ether, suggesting that riders are so bad that they need to be regulated in order to prevent 'abuse' to the horse.
There has also been some focus on # 3, the judges as users of the gage. Rebecca's show report sparked some discussion on the question of how well judges are scoring rides when poor rides end up with scores in the 60's, and the USDF's own data shows that out of 107,232 rides in the 2007 show year:
.001 % (1 ride) scored in the 20-29.99 % range
.026% (28 rides) scored in the 30-39.99 % range
1.926% (2,065 rides) scored in the 40-49.99% range
33.522% (35.946 rides) scored in the 50-59.99% range
58.377% (62,599 rides) scores in the 60-69.99% range
6.028% (6,464 rides) scored in the 70-79.99% range
.118% (126 rides) scored in the 80-89.99% range
.003% (3 rides) scored in the 90-100% range
The few numbers of very low and very high scores, with a strong clustering in the mid-range suggests that judges are not using the full range of scores, and therefore not conveying to riders the information they need on the state of their riding and training. This has been extensively discussed in the related threads.
There is much less attention being given to # 2, the reliability of the gage.
Dressage tests have been scored on the same 0-10 scale since when? The Beginning of Time? Force of habit does not mean that is the best way to do it. And as we have seen in the pages and pages of discussion on the issue, even the word descriptors can be interpreted in different ways: what is 'sufficient'? Sufficient for the level currently being shown? Sufficient for advancement to the next level? Sufficient not to be injurious to the horse? Sufficient not to be a total embarrassment? (Yes, I know, it is defined as sufficient to meet the guidelines of the level ... just using a little literary license here .... )
The thread I started earlier about changing the collective marks, to put more weight on the rider scores by giving the judges more boxes with which to assess the rider, is one possibility.
Changing the scoring scale itself is another possibility -- why not use half marks (as is done in Freestyles)?
Why not use a more expansive set of numbers?
Why not commission a study by experts in the measurement field to evaluate the scoring system and design a new one (or validate that the current one is 'sufficient' for the job)?
*star* - just thinking on these things .....
Ambrey
Apr. 9, 2008, 04:11 PM
I disagree that the clustering of mid-range scores says that judges aren't using the full range. That clustering is pretty normal for any scored test, athletic or otherwise. What it does tell you is that, since the clustering is at the 60-69 range, that's the mode- but in this case, the curve is skewed, with many more people scoring below the mode than above it. So the "average" score would be below the most common score.
I'm speaking from a psychometric sense, and I've no idea what the theories surrounding judging and scoring of athletic events are, but for a psychometric test this would be undesirable. There should be more spread across the middle range, and less cluster at the upper end- I think it means there isn't enough discrimination between tests in that modal range- that the difference between a 65 and a 69 is a lot less than the difference between a 69 and a 73.
Reliability has to do with the ability of the measure to repeatedly get the same result. Validity has to do with whether the measurement is actually measuring what we want it to measure. At least in my field ;)
It looks kind of like grade inflation- where a C is supposed to be average, but instead most people get Bs, a few As, the rest Cs and Ds and the average is below the mode.
siegi b.
Apr. 9, 2008, 04:17 PM
... and as somebody who was actually trained in "work measurement" I can tell you that any type of measurement should only be applied to repetitive tasks/events. Many years ago the point was driven home that any kind of creative task/event, or one that requires more than a little thought cannot and should not be "measured".
Roan
Apr. 9, 2008, 04:23 PM
... and as somebody who was actually trained in "work measurement" I can tell you that any type of measurement should only be applied to repetitive tasks/events. Many years ago the point was driven home that any kind of creative task/event, or one that requires more than a little thought cannot and should not be "measured".
Agreed and I'm trained in SPC (Statistical Process Controls), JIT, Production Planning and Management and a buncha other stuff like that. Been a while, but I doubt the basic concepts have changed much.
Eileen
ShotenStar
Apr. 9, 2008, 04:55 PM
So Siegi and Eileen -- is there a better way to rank the relative performance of a group of horse/rider pairs performing the same test?
*star*
Roan
Apr. 9, 2008, 05:22 PM
So Siegi and Eileen -- is there a better way to rank the relative performance of a group of horse/rider pairs performing the same test?
*star*
Star,
I wasn't trying to kill your idea, really. It sounds like a great idea, but with people not being robots and each having individuality, it just isn't doable.
Dressage, IMO, is an ART and one person's Bob Ross might be another person's Edward Munch or Monet. Hrm, okay, that's kinda extreme, but the gist is the same ;)
As for what I would come up with -- I'm not educated enough in dressage to propose something. I think that anyone could come up with a ranking system, but one that actually *works* needs to come from peoples that have experience with upper level dressage AND measurement systems.
Just my 2 cents,
Eileen
J-Lu
Apr. 9, 2008, 08:31 PM
Personally, I think the range of 1-10 gives a sufficient spread between "totally awful" and "spectacularly correct" to be a useful metric scale. I don't think *the resolution* is good enough to separate a 6.5 from a 6 or a 7 because I think the greatest source of error in the measurement comes from the judges. If 50 judges watching the same video can reliably score movement x a 6 and movement y a 7, THEN we can start talking about awarding half-points. :lol: In principle, I think it is a good idea but I don't think half-points would provide accuracy over integers. I think we'd end up with a bunch of 5.5s and 6.5s and the total score histogram would look very similar to what you posted above. Just my thoughts, j.
pluvinel
Apr. 9, 2008, 09:25 PM
Agreed and I'm trained in SPC (Statistical Process Controls), JIT, Production Planning and Management and a buncha other stuff like that. Been a while, but I doubt the basic concepts have changed much.
Eileen
Well I guess I should pipe in since I was the proponent of this idea. The concept of judging "human performance" in industrial applications is not new. As a matter of fact I have recently had to take testing myself to demonstrate that I can calibrate such a system appropriately.
Judges do the same thing...they sit an observe a human perform the same test over...and over...and over....and over again....talk to any judge who has sat for 8 hrs judging a Training Level class....no difference than someone who has to judge any service operators.
Everyone in the dressage worlds bemoans how "subjective" dressage judging is, as if here is nothing that can be done. LOTS of stuff can be done to standardize judging.
There is a whole body of knowldege in the world of "industrial quality engineering" that addresses subjective judgeing of human performance. The USDF/USEF could be leading the equestrian world in judging dressage if it adopted those ideas.
For those of you who want to find out about the use of Kendall's Coefficient of Concordance along with the Kappa parameter to improve dressage judging, this article might shed some light on how judging can be improved. The example below shows how statistics can be used evaluate the quality of customer service operators. The people being judged are customer service representatives and their evaluators are judging the quality of the call. No different than riders performing in front of judges judging various parameters of quality.
In an isolated room, three QA specialists: Judith, Malcolm, and Susan, listen to ten recorded conversations between actual customers and customer service operators. Each specialist grades the operators' responses based on characteristics such as friendliness, accuracy, and suitable advice. These grades are used to obtain a comprehensive score, which is used in the analysis
You can read the rest of the story by clicking on the link. It is written by one of the companies that provides statistical analysis packages to industry.
http://www.minitab.com/resources/Articles/Attribute%20Gage%20R&R%20KT%2035.pdf
I do this sort of work on a daily basis as a quality engineer for a Fortune 50 company.
Ambrey
Apr. 9, 2008, 10:56 PM
Personally, I think the range of 1-10 gives a sufficient spread between "totally awful" and "spectacularly correct" to be a useful metric scale.
Only if the judges are using the entire range. If they are really focusing on the areas between 5 and 7, with only outliers outside of that range, it changes things!
J-Lu
Apr. 9, 2008, 11:21 PM
Only if the judges are using the entire range. If they are really focusing on the areas between 5 and 7, with only outliers outside of that range, it changes things!
Exactly the problem. I can forsee that if half-points are added, judges will still only focus on the scale between 5 and 7, giving 5.5s, 6s and 6.5s, wiht the resulting score being similar to the score if only integers are used. The scores will continually close in on the mean...barring disaster for a specific movement. So instead of 5,6,6,6,7,5,6,6 we'll see 5.5,6,6.5,6.5,6.5,6.5,6,5.5,6. But still, there is the problem that for half-points to work, judges have to be very clear on what is a "6" and what is a "7" in order to describe a "6.5" without totally guessing.
This is why I think the scale is fine. I do have a problem with the judging, though. The judges, IMO, should use the whole range and score what they see. I have no sympathy for a judge giving any other score than what is earned. If they are worried about aggravating riders or not being invited back, then they shouldn't be judging. Period. I'd love it if in my job, I can "pad" data so as not to make people feel bad. But I'd be fired in an instant.
Ooops, but this isn't what the thread is about. :winkgrin: So to reiterate, I think the integer system is fine if, as pointed out, it is used correctly.
J.
Dressage Art
Apr. 9, 2008, 11:28 PM
1.926% (2,065 rides) scored in the 40-40.99% range
33.522% (35.946 rides) scored in the 50-50.99% range
58.377% (62,599 rides) scores in the 60-60.99% range
6.028% (6,464 rides) scored in the 70-70.99% range
The few numbers of very low and very high scores, with a strong clustering in the mid-range suggests that judges are not using the full range of scores
I agree that judges are not using the full range of scores, but I also know that it's quite difficult to do.
***Scores of 4-5-6-7-8 are the safe scores for judge to give and nobody questions those scores, so judge will not be put on the spot and demanded to back up her/his scores. If scores will be different from other judges, it's not going to be different by much, so again - it's safe to stay with in a "safe range of scores"
***Judge has to have "balls" to use the full score range. Because it is so uncommon, it'll be a statement and judge must be ready for the lime spot light and be ready to take some of the negative pressure with going out of the group and using the whole score range.
***If majority judges don't use the whole score range, then when a few judges decide to use the whole score range - they are not judging by the same standard.
***To use the whole range judge has to be very knowledgeable and experienced to be able to defend her/his scores.
~Freedom~
Apr. 9, 2008, 11:38 PM
I agree that judges are not using the full range of scores, but I also know that it's quite difficult to do.
The problem is that you simply cannot award scores to a competitor that doesn't deserve it.
I have always used the full range of scores but in a normal class the majority WILL sit at the 5--6 score.
I have given both a zero and a 10 but when I have, they were deserved.
claire
Apr. 10, 2008, 05:35 AM
The problem is that you simply cannot award scores to a competitor that doesn't deserve it.
But, here's the thing, ARE judges awarding the "deserved" scores?
This statement from another judge is very concerning:
Scores of 4-5-6-7-8 are the safe scores for judge to give and nobody questions those scores, so judge will not be put on the spot and demanded to back up her/his scores. If scores will be different from other judges, it's not going to be different by much, so again - it's safe to stay with in a "safe range of scores"
If this is true, it would appear that the judges do NOT want to venture outside the "safe" range even if deserved?
Why?
~Freedom~
Apr. 10, 2008, 06:55 AM
If this is true, it would appear that the judges do NOT want to venture outside the "safe" range even if deserved?
Why?
Why because 90% of the times a competitor WILL be relatively consistent throughout their tests.
I have seen and judged enough tests that when a horse/rider works through their test they tend to remain within that quality level throughout. If the test shows scores of 9, 4, 6, 8, 3 then that rider is riding extremely uneven or the horse is.
While it happens it simply isn't the norm. Most tests ARE relatively even throughout and an average rider on an average horse will get average scores of 5,6.5.4.6.5 simply because they are not or cannot ride at a higher quality level. It is simply logistics that majority of riders will fall in that score range.
It isn't the scores given that we should be concerned with but the attitude of the judge at the start of each test. That judge MUST be positive before that rider even sets foot inside the arena and be WILLING to give a 8 or 9 or even a 10. Too often I feel that judges see the horse before it enters and have already decided it is a 5ish horse/rider combination. That judge will rarely give a deserved 7 or 8 here or there when it is warranted during the following test.
Roan
Apr. 10, 2008, 06:58 AM
pluvinel,
I left the QA and statistics field well over ten years ago and, I really have neither the time nor inclination to pursue the study. From the brief glance I took at the study, it's number fudging.
*shrug*
IMO it's not really relevant nor will it work unless all judges are judging in the same manner and rewarding marks for the right things.
They aren't. If they were a lot of WBs would not be receiving the high marks they commonly get.
Another can of worms and I'm not going to open that one.
Anyhow, we have to treat the cause, not the symptoms.
Eileen
ShotenStar
Apr. 10, 2008, 07:46 AM
An excellent discussion so far .... and just a reminder: it IS THE DISCUSSION that is important .... not the outcome.
There are so many things about dressage that we all take for granted and never actually THINK about, like the scoring system and the possible impact of a performance standard. Talking through these issues and identifying the good and bad points will help all of us better understand the 'system and process' of showing dressage.
I agree that any type of 'performance' with a large population will always fall out in a Bell Curve; the mean and the median should be near the center of the scoring range -- maybe shifted slightly right to allow for the fact that only those who are doing well should be presenting themselves for evaluation. I think I would like to see a slightly flatter and broader curve than the ones produced by the current population / judging -- meaning that the judges are using more of the score range and that poor riders are either not arbitrarily excluded or inappropriately scored.
*star*
AM
Apr. 10, 2008, 09:52 AM
My usual volunteer job is scoring which I do for both dressage and eventing. In our area the same judges are working both the straight dressage and the dressage portion of eventing competitions. The judges do use the whole scale when they have the opportunity. I see this more in the eventing competitions because eventers still think the jumping phases are the most important and can make up for a few dressage mistakes. I even saw a 0 on one eventing test last fall.
I've also heard my instructor who is a judge express the same thoughts as Freedom. She says most of the rides she sees are in the 5-6-7 range - no one does anything so bad as to deserve a lower score nor so good as to deserve a higher score.
I see that the straight dressage riders are more consistent before they venture into competition because they are putting all their eggs into one basket. Eventers on the other hand will step into competition if they don't always get the correct canter lead because their horse jumps really well.
Ambrey
Apr. 10, 2008, 10:24 AM
The problem is that you simply cannot award scores to a competitor that doesn't deserve it.
I have always used the full range of scores but in a normal class the majority WILL sit at the 5--6 score.
I have given both a zero and a 10 but when I have, they were deserved.
So the issue is one of two things. Either the measurement system doesn't allow enough discrimination in the middle ranges, or you have a truly homogeneous population.
So if most people are riding a 6, either you could add more judging criteria or just face the fact that it's impossible for the human eye to discriminate past the current criteria.
I'm here simply as a fan of statistics and measurement, I really don't have the slightest clue about dressage judging. There ARE limits to human judging. Maybe we've just reached it?
eta: What I mean is, if we changed the scoring to provide a greater range for the riders who are doing a good enough job, rather than making quite a few of the scores "insufficient" type scores, could judges really adequately score it, or would we be putting too much pressure on human perception?
Dressage Art
Apr. 10, 2008, 01:22 PM
The problem is that you simply cannot award scores to a competitor that doesn't deserve it.
Yes, that's what it came down to ... our instructors were VERY encouraging for us to use the full range in the classroom, but when we went out to a show, our instructors kept repeating the same thing:
QUOTE~Freedom: "you simply cannot award scores to a competitor that doesn't deserve it" for the sake of using the whole score range.
It is a VERY fine line.
It isn't the scores given that we should be concerned with but the attitude of the judge at the start of each test. That judge MUST be positive before that rider even sets foot inside the arena and be WILLING to give a 8 or 9 or even a 10. Too often I feel that judges see the horse before it enters and have already decided it is a 5ish horse/rider combination. That judge will rarely give a deserved 7 or 8 here or there when it is warranted during the following test.
True, the attitude that judge has even before the test and their openess for using the whole score range is most important!
Most people don't even imagine how fast the scores are coming at you: an average test has 20 movements in 5 minutes - judge has to spit out the meaningful comments with correct scores in seconds + think of collective marks at the same time (certain scores are direct indicators of collective marks: such as transitions relate to “submission” score and Turn On The Hunches relate to the “rider” score)! And that keeps on going for the whole day! There is no time to hesitate! It's very easy to become a creature of habit and start judging on autopilot.
I sometimes wonder how judges who judged for 20 years feel during their show day? Do they still have to think during the test or do they just draw their scores and comments from their vast experience and don’t bother question the situation?
Dressage Art
Apr. 10, 2008, 01:37 PM
I wrote reports last year for the local GMO Chapter Newsletter about my experience from USDF "L" program. there is a part about "Decimal thinking”, “whole-number” judges and “manipulating” the scores:
SESSION A #1 of USDF "L" Dressage Judges Program (from dressageart.com)
1. First day an international Judge and GP rider and trainer Jeff Moore gave us a lecture about dressage judging and dressage Biomechanics. Dressage in US is fairly new, USDF was formed only 33 years ago and USDF “L” Judging program was formed 15 years ago to create better US dressage judges. Dressage judging in the US is constantly evolving and changing for the better. US judges continue to debate what is the best for the future of the US dressage.
One of those hot topics is the “Decimal thinking” in judging. Most of the performances will fall in to the range of “5” to “7” scores. A score “6” can be a “strong 6” and a “weak 6” (6.1 and 6.9 – a difference of a full 10% of the final score), but judges have no way of showing this distinction to the rider. If one “whole-number” judge will give 5 for all of the movements that were 5.5 and another “whole-number” judge will give 6 for all of the movements that were 5.5 – those judges will end up with the final scores of 50% and 60% - a 10% difference for the same ride. Current judges are forced to come up with different systems of “manipulating” the scores to be able to show the decimal thinking to riders and even up the final scores. For example if one pirouette was a 5.8 and another was a 6.2, instead of giving a 6 for both, judge can wait to see both pirouettes and give a score of 5 for the first one and 7 for the second one to show the difference to the rider, but the scores will even themselves up to be 6 in the end. However, obviously it’s not the perfect way of judging and helping riders, so dressage is in need of more sophisticated judging system, a “decimal judging system.”
ideayoda
Apr. 10, 2008, 01:50 PM
For me it is very important that the judge make a statement to the rider about what is particularly good within their ride, thus to use the range to denote correct TRAINING/balance/etc within a ride. Highlight the 'yes, do THAT again'. We must look for what is to be rewarded and simultaneously not give free passes for poor work.
This imho is esp true at higher levels. It is one thing to let a more mediocre seat slide at intro/training/even first. But riders who sustain their seats courtesy of a torqued curb bit, that totally ignores the plight of horse as well as the sustaining of the quality of training.
Ambrey
Apr. 10, 2008, 02:36 PM
I wrote reports last year for the local GMO Chapter Newsletter about my experience from USDF "L" program. there is a part about "Decimal thinking”, “whole-number” judges and “manipulating” the scores:
But it is not the decimals that are the issue here, but the values placed on those decimals.
If it can either be a strong 6 or a weak 6, and judges can tell the difference between those, then broaden the scoring points above and below it.
When we want to score psychological tests, we normalize the scores so that we have a greater discrimination in that high freqency area. This isn't possible with dressage, but if the range of movements being scored with 6 is too high, then shift some of those up and some down within the scoring guidelines (not as individual judges).
Then, instead of that drop dead wonderful olympic ride being scored in the 80s, it would be scored in the 90s, the great pros would score in the 80s, and the vast majority of people could have the entire range from 50 to 80 to judge improvement or compare rides to others, rather than the current range from 55 to 65.
I'm sure there's some reason this just wouldn't work- but it's what I'd do if I was coming up with some sort of scoring system and ran into this issue :)
Whisper
Apr. 10, 2008, 02:56 PM
Ideayoda, I know that my instructor really focuses a lot on how my riding influences the horse - ie. I half-halted at the wrong instant, or leaned a little too far forward and dumped the horse on the forehand. Can judges try to connect the horse's faults back to the rider a bit or express it the other way around more often to make it more obvious to the riders? I know that some instructors only seem to focus on what the horse is doing, so they might not be aware until they hear it from someone else.
I think that half-marks might be helpful, but they're more likely to determine placings between horses/riders who are very close than to expand the range used. I don't think it will change the general score range or distribution that much, though. I've had a couple of 8's and a 3 in the same test (I forgot to transition to canter until the last second, so lost points), and I've seen scores ranging from 13 to 54 penalty points at Intro (equivalent to 87%-46%) in one class, so at least *some* judges are using the full range.
I've heard that some judges don't feel comfortable giving low scores, especially for rider, because they believe that riders will complain. I think that is doing the riders a disservice, since if they aren't aware that there is a problem, how can they fix it? I'm not sure about the higher scores - I haven't heard any theories advanced as to why judges would be afraid to give *high* scores that are deserved. I'm curious what it *would* take to get a score in the 90's - would a top-level rider and GP horse entering TL or 1st level get that score from most judges, or is that kind of score unattainable for anyone? I have to assume that if they give a verbal "that's great" along with a score of 7, it's along the lines of a riding instructor saying the same thing to a student for a relatively incremental improvement "hey, that's more along the lines of what you're shooting for," rather than as an objective "that's the best I've seen" or even "that's meeting what the standard *should* be."
Vaulting also uses a 1-10 scale for each movement (or section of the freestyle), but with one allocated to the horse rather than to the rider. In the low-level classes, most people score between 3 and 5 to start with, but 0, 1, and 2 are common. Beginners who are especially physically adept tend to get 6's and 7's, and usually quickly move up a level. Throughout the year, usually the average scores move up as the people improve. Once they move up to the next level, the scores might go down a bit, but usually aren't below 4 unless someone loses their balance or something. At the upper levels, almost everyone scores 8's and 9's. I've never heard anyone claim that the lower scores for beginners discourage anyone - it seems to be generally accepted that once we do better, we'll get better scores. Our scoring controversy right now is that most judges want dressage style gaits with loads of suspension and the horse really using itself. That usually makes vaulting harder rather than easier (better energy for a few of the moments, but harder to stand up there), and some horses even get an 8 in spite of spooking or otherwise making a major mistake, whereas a nice steady-eddy who makes it easy for the people on his back just gets a generic 6.
At our first show of the season, I was near the bottom of the class in my compulsories and around the middle in my freestyle. The winners in both classes scored around 70%, while everyone else was in the 50%'s, all clumped together with less than 5% between them. The winners were people who moved up at the end of last season, so I wasn't suprised or upset, and noone else seemed to be either.
ShotenStar
Apr. 10, 2008, 03:24 PM
So the issue is one of two things. Either the measurement system doesn't allow enough discrimination in the middle ranges, or you have a truly homogeneous population....
I don't see it as "Either/ Or" .... I see it as the interplay of the two: the measurement system does not have enough range to allow for finer gradations of performance AND the performances presented in licenced shows tend to be homogeneous.
Now, we can debate if the homgeneity is trending good or trending bad .... the DC's drive to implement a Performance Standard suggests they think it is trending bad .... but even if they stop the low scoring riders from moving up the levels, they will still be confronted with a homogeneous population and a poor tool for ranking them.
*star*
Eclectic Horseman
Apr. 10, 2008, 03:44 PM
This discussion brings to mind the book, "Blink: The Power of Thinking without Thinking" by Malcolm Gladwell.
Very accurate assessments can be made by the human brain on a minimal amount of data. Sometimes too much information or overthinking can blur an otherwise correct impression.
The down side of making judgments based on limited factors is that bias based on stereotypes is the inevitable result.
I think that dressage judging shows textbook examples of both phenomena, and that is why reliability in any type of scoring system is unavoidably limited.
claire
Apr. 10, 2008, 05:13 PM
Most tests ARE relatively even throughout and an average rider on an average horse will get average scores of 5,6.5.4.6.5 simply because they are not or cannot ride at a higher quality level. It is simply logistics that majority of riders will fall in that score range.
Absolutely, tests will fall in a bell curve with the majority of scores in the 5-6-7 range.
But, going back to the DC problem statement of so much "bad riding"...and the (initial) qualification scores of
"60 +" to move up.
According to the data analysis, less than 3% of the rides were scored 40% and below (insufficient-very bad).
I am seeing what appears to be a communication and/or measurement problem.
Because, either there isn't a huge problem with "bad riding" (in which case, what IS the problem?)
Or the "bad riding" is being scored 50-60% range (sufficient-fairly good)
(in which case, WHY won't the judges score the bad riding as such?)
Or, the definition/meanings assigned to the scores are not accurate/clear.
(ie. a 5 is NOT seen as sufficient but as fairly bad)
And THEN, there are the comments that judges do not like to score outside of the 5/6/7 "safe range" for fear of: being questioned/not being invited back/or being "sued".
But that is a whole other ball of wax...:confused:
Dressage Art
Apr. 10, 2008, 05:24 PM
Or the "bad riding" is being scored 50-60% range (sufficient-fairly good)
(in which case, WHY won't the judges score the bad riding as such?)
5 - is "marginal"
6 - is "satisfactory"
So that would be "marginal - satisfactory" range, not "(sufficient-fairly good)" It's quite difficult to score below 50% or to judge with a final score of below 50%, b/c there are about 20 scores in the test and some scores will rebalance scores of 4 and lower. Judge can give a 4 and a 7, but at the end that will come out the same as giving 5 and 6. It’s easy to score a 5 or 6, but it’s difficult to score 4 or 7.
33.522% (35.946 rides) scored in the 50-59.99% range
58.377% (62,599 rides) scores in the 60-69.99% range
Some of the 33% of riders who score in 50%ile ALL the time and keep on progressing up the levels scoring in 50%tile - will be addressed by the new standards rule. "Marginal" riding is NOT an indication that horse/rider are ready to move up to the next level. "Marginal" is bad enough and should indicate to riders that they need to do some more homework before moving up to the next level with scores in 50%tile.
J-Lu
Apr. 10, 2008, 06:30 PM
Dressage Art,
According to the 2008 USEF dressage rule book, a 5 is "sufficient" not "marginal". The scores in question were earned when a 5 = "sufficient", so we're talking "sufficient to fairly good". Equating it to "marginal" at this point is erroneous.
http://www.usef.org/documents/ruleBook/2008/08-DR.pdf
The correct interpretation of that data is that 33% of ALL RIDES in that time frame scored in the 50s. This says nothing about the frequency with which people who scored in the 50s continue to score in the 50s. It looks to me that you are wrongly stating that this means that 33% of all rides are rides are from riders who always score in the 50s. You might want to review the entire report to get a good understanding of what was found (they did follow riders as well).
Dressage Art
Apr. 10, 2008, 06:48 PM
The scores in question were earned when a 5 = "sufficient", so we're talking "sufficient to fairly good".
“Fairly good” is a score 7. If you are talking about scores 5 and 6, descriptions of "fairly good" doesn't belong there.
"Sufficient" is changing to "marginal" for the obvious reasons that some people think that score 5 is a quality respectful score. Te new vocabulary will take care of that as well.
claire
Apr. 10, 2008, 06:49 PM
I think ShotenStar best expressed my (initial) point that we are dealing with several different components of measurement/scoring:
In the case of dressage tests, the 'components' are:
1. a component due to the characteristics of the part or item being measured -- the performance of the horse / rider being assessed in the ring at that moment. In the case of dressage, there are actually three measurements taken during a test: the scores for the movements; the collective marks for the horse; and the collective mark for the rider.
2. a component due to the reliability of the gage -- the scores / scoring system used in dressage tests -- the numbers 0-10 and the meanings assigned to them.
3. a component due to the characteristics of the operator (user) of the gage -- the judge and the guidance given to judges through the rules and definitions of movements / criteria for scoring.
All three of these points need to be considered when evaluating a measurement system. Right now, the discussion is focused mainly on # 1 - the performance of the horse and rider and the need to impose limits or restrictions on who may show at what levels (the Performance Standards Rule change issue). There is a certain amount of 'blame' floating around in the ether, suggesting that riders are so bad that they need to be regulated in order to prevent 'abuse' to the horse.
There has also been some focus on # 3, the judges as users of the gage. Rebecca's show report sparked some discussion on the question of how well judges are scoring rides when poor rides end up with scores in the 60's, and the USDF's own data shows that out of 107,232 rides in the 2007 show year:
.001 % (1 ride) scored in the 20-20.99 % range
.026% (28 rides) scored in the 30-30.99 % range
1.926% (2,065 rides) scored in the 40-40.99% range
33.522% (35.946 rides) scored in the 50-50.99% range
58.377% (62,599 rides) scores in the 60-60.99% range
6.028% (6,464 rides) scored in the 70-70.99% range
.118% (126 rides) scored in the 80-89.99% range
.003% (3 rides) scored in the 90-100% range
The few numbers of very low and very high scores, with a strong clustering in the mid-range suggests that judges are not using the full range of scores, and therefore not conveying to riders the information they need on the state of their riding and training. This has been extensively discussed in the related threads.
There is much less attention being given to # 2, the reliability of the gage.
J-Lu
Apr. 10, 2008, 08:08 PM
“Fairly good” is a score 7. If you are talking about scores 5 and 6, descriptions of "fairly good" doesn't belong there.
"Sufficient" is changing to "marginal" for the obvious reasons that some people think that score 5 is a quality respectful score. Te new vocabulary will take care of that as well.
Sorry, I was referring back to your previous post (and someone else's post).
"Sufficient" has not yet changed to "marginal", so using the term "marginal" is incorrect. Currently, "sufficient" means that the basic criteria are fulfilled. Thus, if one currently scores in the 50s, the judge has deemed that the basic criteria for the test have been met.
How people value the score of 5 and whether or not you agree is irrelevant. Currently, a score of 5 means that the basic criteria are fulfilled, and only the judge of the given ride can say why a 5 was awarded.
pluvinel
Apr. 10, 2008, 09:20 PM
pluvinel,
I left the QA and statistics field well over ten years ago and, I really have neither the time nor inclination to pursue the study. From the brief glance I took at the study, it's number fudging.
*shrug*
IMO it's not really relevant nor will it work unless all judges are judging in the same manner and rewarding marks for the right things.
They aren't. If they were a lot of WBs would not be receiving the high marks they commonly get.
Another can of worms and I'm not going to open that one.
Anyhow, we have to treat the cause, not the symptoms.
Eileen
We are in violent agreement.
The article I quoted before describes the procedures to test the degree of agreement (concordance) between appraisers (judges). The article describes how the Kappa statistic and Kendall's coefficient of concordance are computed to calibrate a judging system. Perhaps it was too geek-speak.
>>> The Kappa statistic is used to test the level of absolute agreement among appraisers. Kappa of 1 indicates perfect agreement.
>>> Kendall's coefficient of concordance is an index of the degree of agreement between several variables measured by various raters....eg., judges.
My point is that there is technology available to allow one to quantify the level of "agreement" (or disagreement) that people say happens now in dressage judging.
The body of statistical knowledge to measure human performance is robust, used by many disciplines and is not considered "number fudging".
There are many references on the use of these techniques. Judging human performance is a well-developed body of knowledge that is used by psychologists, sociologists and industrial engineering practicioners. The USDF/USEF need not re-invent the wheel....it need not rediscover that the earth is round. The knowledge is there for the taking.
Here is a shorter explanation on concordance or inter-rater reliability from wikipedia.
Concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.
http://en.wikipedia.org/wiki/Inter-rater_reliability
Some references on the subject....this really is a well studied topic....
Gwet, K. (2001) Handbook of Inter-Rater Reliability Shrout, P. and Fleiss, J. L. (1979) "Intraclass correlation: uses in assessing rater reliability" in Psychological Bulletin. Vol. 86, No. 2, pp.420--428
Fleiss, J. L. (1971) "Measuring nominal scale agreement among many raters" in Psychological Bulletin. Vol. 76, No. 5, pp. 378--382
claire
Apr. 10, 2008, 10:25 PM
Quote:
Concordance is the degree of agreement among raters. It gives a score of how much homogeneity, or consensus, there is in the ratings given by judges. It is useful in refining the tools given to human judges, for example by determining if a particular scale is appropriate for measuring a particular variable. If various raters do not agree, either the scale is defective or the raters need to be re-trained.
pluvinel, Interesting article.
However, it occurs to me that objective measurement/diagnose and solutions are only as good as an upfront objective statement of the perceived "problem".
What if the actual "problem" isn't about raising the standard of riding/preventing abuse of the horse?
But the solution (qualification standards) IS a solution to the actual "problem"?
eg. How to increase entries to USEF shows?
or
How to appear to the international PTB that while we don't have qualifications for CDI's, we do have qualification standards by golly! ;)
I mean why all the behind the door secrecy, lowering qualifying scores, with-holding of scores and changing the definitions of scores?
Funny, after the data came out showing that 97% of the rides scored 50%+ = (Sufficient +) It is decided to change the meaning of "5" from sufficient to marginal :rolleyes:
Dressage Art
Apr. 10, 2008, 10:59 PM
Funny, after the data came out showing that 97% of the rides scored 50%+ = (Sufficient +) It is decided to change the meaning of "5" from sufficient to marginal :rolleyes:
Re "marginal" - as I know this change been in the works already for 2+ years. Better stick to data and not assume ;)
Dressage Art
Apr. 10, 2008, 11:05 PM
But the solution (qualification standards) IS a solution to the actual "problem"?
eg. How to increase entries to USEF shows?
I have to apply 2 months in advance to get in to the local USEF/USDF dressage shows. They ALL fill up with in 2 weeks of an open date. USDF championship qualifying scores were raised b/c there were too many entries for actual Championships and classes were getting way too large. I fail to see how that data shows that USEF needs "to increase entries to USEF shows?"
The Data is getting twisted!
claire
Apr. 11, 2008, 06:39 AM
The Data is getting twisted!
5 - is "marginal"
Well, the data isn't twisted.
But, something is being twisted for sure! :winkgrin:
Hard to have an objective discussion when the measurement components keep changing.
ShotenStar
Apr. 11, 2008, 07:36 AM
...The Data is getting twisted!
The data is well-behaved.
The DISCUSSIONS are wandering off on many tracks, with many theories and suppositions, because there is NO COMMUNICATION from TPTB -- the Dressage Committee, the EF and DF.
Without their input on the Whys and Wherefores of their thinking, those of us with an interest in the issue are on our own. Yes, yes, they are 'working on it' .... and need time to come up with the revised proposal. It would be nice if they could communicate and work at the same time.
*star*
ToN Farm
Apr. 11, 2008, 07:37 AM
It's quite difficult to score below 50% or to judge with a final score of below 50%, b/c there are about 20 scores in the test and some scores will rebalance scores of 4 and lower. Judge can give a 4 and a 7, but at the end that will come out the same as giving 5 and 6.This has always bothered me. There are 'key' movements in every test, yet a rider can score low on these important movements and make up for it by scoring high on the easier movements. For example, you can have a great walk (coefficient) and nail your halts (sometimes 3 of them), and get a couple good transitions. This allows you to miss your flying changes, pirouettes, etc. and still come up with a 60% score. I'm not saying that all movements aren't important, but I do think that some are more indicitive of whether a pair is ready for that level of work.
Dressage Art
Apr. 11, 2008, 06:59 PM
This has always bothered me.
Ideally, that should be addressed in the collective scores. However, IMHO collective marks are too general and have too much stuff packed in to them, so AGAIN some things are compensating for others and as many of us see most of the collective marks are 6 or 7 = pointless!
I think it'll be great if US can add one more collective score that reflects the CORE of the level:
Training Level: freedom of the gaits = score
First Level: pushing power = score
Second Level: carrying power = score
and so on...
That would underline the importance of the BIG picture.
As for walk, I think it's the most important dressage gait and deserves even more attention and scores.
As for the halts, there are 2 of them at EACH level - so better learn how to rack up the points from them ;) it's the only free-be that dressage tests give you.
pluvinel
Apr. 12, 2008, 08:05 AM
Ideally, that should be addressed in the collective scores. However, IMHO collective marks are too general and have too much stuff packed in to them, so AGAIN some things are compensating for others and as many of us see most of the collective marks are 6 or 7 = pointless!
...................
This is the reason for challenging the "robustness" of the scoring system and how it is used. Based on my attendance at both L-Judges seminars and the old AHSA judges' licensing seminars (at Gladstone), these "quality" questions arise:
1-No "bad rides" that would merit scores of 0-1-2-3 were used to illustrate how to use that scale of bad (fairly bad, bad and very bad plus "not executed") at L-judge's training.
2-No "excellent rides" were shown to illustrate what a "10" ride at L-judge's training. There were debates on scoring between 7-8-9, but nothing was shown to clearly illustrate the "pinnacle" of what is looked for in the movement.
3-The one example of a "10" that was used was for the collective marks of gait. Florencio's canter was shown to illustrate this. Yet it was a clearly 4-beat canter.
4-At the AHSA judges seminars, judges had to judge the rider's movements by holding up cards so one could clearly see the variability in opinions. It is very similar to the current L-judges format, just with licensed judges. The instructor judges would question people giving high, low and pick on people who only gave out 6's.....it was very interesting to see and hear justifications for the scores. It was an excellent discussion for a rider trying to learn what judges look for. In one case that stands out in my memory, on judge held up a 7 another judge held up a 4 for the same movement. Those people are still active today.
For a quality engineer this raises questions about:
1-The standard itself (versus what is stated in the rules) since the rules state that the canter is 3-beat, but the illustration for evaluating the canter is a 4 beat example. The fact that an important horse received a 10 for a gait that is in conflict with the rulebook.
2-The training of evaluators on how to use the standard, since the upper and lower end of the scales were not taught in the training.
3-The evaluators' (licensed judges) use of the evaluation scale as demonstrated by the variability in the scores for the same movement. The judges that gave out those marks are still judging.
Since I had the privilege and luck to attend that one judges licensing seminar, auditors have not been allowed in to these sessions by USEF.
slc2
Apr. 12, 2008, 01:14 PM
I don't think dressage judging is all that subjective, I think it can and has been judged objectively, in the sense that if you do X you get a certain score and if you don't you get a certain score.
I feel it's very obviously defined what they are looking for - books like 'Priority Points' (Dane Rawlings) and Crossley's book on dressage basics makes it very clear where the points come from and where they go away.
I feel the judges do by and large use the scale of 1-10 correctly. Linda Zang came out with an excellent book where she went into this in great detail and she used a lot of her international experience judging and provided us with a lot of information. Impressions of her book?
I think that the complaints of not using the scale correctly come out because:
- people see OTHER riders scoring what they feel is too high - especially with the more prominent top riders.
- people want their OWN scores to be higher
thats where i think people conclude judges don't use the range. in fact they very much do. People DO get 0, 1, 2, 3, 4, 5. And yes, they very rarely get a 10 or 9, and I think that's fair.
I think looking at the top 3-4 placers in an international competition gives one a very distorted look at judging, where most of the riders consistently do get 6,7,8.
I also think that alot of people have pet peeves and see that trait and nothing else, and they want a judge to score a horse really low, if, for example, it does xyz during the test, I also think they look at the topriders with a kind of magnifying glass, in a very distorted way, expecting impossible things, magnifying faults in those they dislike and ignoring faults in ones they like.
i think they get mad when they see the rider doesn't score as low as they think they should. i'v'e heard people say 'i want the horse to get a 1 (or a 2 or 3) in every single movement where he does xyz', and in fact, NO ONE judges that way and they never have. only certain things get really low scores and that is defined very clearly.
I think the argument is more appropriately worded, 'do we punish riders enough in how they are scored'.
From what I have seen, both in my own scores and in other people's over many years, that most judges actually judge them very well, extremely consistently, so that riders get the same scores over and over and over, unless they do the movement a different way, then they get an (obviously) different score.
When I watch most rides, I can very easily tell why the person got a lower score than the previous day, say, or why they didn't get as high a score as they wanted. I find most judges are extremely consistent, and that from one judge to another, even, the judging is very consistent, even down to getting the same marks over and over for the same movements. The amount of variation that is present, I think one can deal with...and no, I don't think there are better alternatives, including half points.
vBulletin® v3.6.8, Copyright ©2000-2012, Jelsoft Enterprises Ltd.