Judging the judges – how statistical analysis evaluates fairness and accuracy in gymnastics scoring

Dr Hugues Mercier speaks exclusively to Olympics.com about his 10-year project that evaluates judges' performance in disciplines such as artistic and rhythmic gymnastics.

7 minBy Jo Gunston
Judges at the artistic gymnastics competition at Tokyo 2020
(Photo by Patrick Smith/Getty Images)

"In rhythmic gymnastics, they always have to cut the air conditioning for the ribbons because you don't want the ribbons to move with the draft of the air conditioning," reveals Dr Hugues Mercier, the man tasked with providing statistical tools for the International Gymnastics Federation (FIG), which help evaluate judges' accuracy and fairness.

"So, in the arena (at the World Championships in Valencia) it was around 35°C the second day (for the qualification and final for clubs and ribbon). It was so hot, and the day was so long," said the Canadian in an exclusive interview with Olympics.com while at the 2023 World Gymnastics Championships in Antwerp. "It's a very hard task for the judges who have to be seated almost for 10, 12 hours."

And therein lies one of the myriad factors that can affect judging performance, which is being researched and analysed by Mercier's company, Maelstrom Analytics and Technologies, in a project started a decade ago.

These evaluations, or patterns, are not acted on by Mercier but passed on to the FIG to implement or research further in order to have the fairest system possible.

So, the project is ongoing with accuracy and fairness in judged sports such as artistic gymnastics and rhythmic gymnastics at its heart. But how does it work?

Olympic Membership | Free Live Stream Sports & Original Series - join now!

Hugues Mercier of Maelstrom Analytics and Technologies, a company that analyses judging accuracy and fairness in gymnastics scores

(Hugues Mercier)

Judgement call

First up, practical matters. What do judges actually do?

Taking artistic gymnastics as an example, each judge marks a routine based on the FIG's Code of Points (COP), the rulebook that defines the scoring system.

Re-evaluated each Olympic cycle, each routine has a D-score (a difficulty score) and an E-score (execution).

All gymnasts begin with an execution score of 10, from which points are removed for faults such as bent legs, arms, and falls.

The difficulty score is the total of the marks of the hardest moves in the routine – barring vault, which has one value nominated for each vault – each of which has a value assigned to it in the COP.

The D-score, therefore, is clear-cut, with a separate panel of judges evaluating the skills; the execution score is more of a challenge.

How does judging work?

The set-up of the judging panel is the first port of call, in achieving fairness in scoring gymnasts.

For the 2023 World Championships, for example, the largest artistic gymnastics competition in the world, there were seven execution judges per apparatus.

The top two scores and the bottom two scores were removed with the average of the middle three figures the resulting mark. This is then added to the difficulty score to give the overall total.

"What you have to understand is that judging execution is very, very hard," says Mercier. "It's very fast and the best way to ensure fairness for the gymnasts is to assume that, since it's a very difficult task, it's completely normal for judges to make a small error once in a while.

"But if we have a large panel of judges, say seven judges, one might have missed a deduction, another might have missed another one but in the aggregate, the panel provides a very good approximation of the true performance quality."

Much scientific research has also gone into making sure the judges mark the performance on their own merits, not influenced by others. To this end, panels are set up between the judges' tables, like blinkers on a horse, to deter communication.

"In order for the system to be efficient, I need to have seven different points of view from seven excellent judges, and it's important that we have what they think they observe – live – not after discussing with other judges because that induces all kinds of biases."

A numbers game

Citing the way his platform manages to measure the two axes on which the judges themselves are judged – fairness and accuracy – he says: "There are a lot of statistical tools that can be used.

"It is a bit tricky because judging is what we call in science 'a random process'... So, when I judge, I make errors, right? I'm going to miss a deduction, I'm going to forget something, I'm going to miss a movement, and so on. And the only way to ensure that our analysis truly reveals the true skills of the judges, whether fairness or accuracy, is to track them over the long term."

Hence the ongoing project that is reaping more and more useful information.

Post Paris 2024, for example, Mercier hopes to look more at the impact of fatigue on judges. The first three days of competition at the World Championships in Belgium, for example, comprised lengthy qualification sessions between 10-12 hours long per day.

Fluctuating arena temperatures also may have an impact, such as the rhythmic gymnastics example cited at the beginning of this article.

"The data that we had before was not rich enough for us to do the analysis, but this is something that we plan after the Paris Games. We would like to study if judges get fatigued and if we can observe it and measure it.

"I don't know if the best judges over long phases are good because they don't get fatigued for instance. Are there judges that are very good for 20 athletes and then their performance starts to decrease as the days get longer? We don't know that yet, but I would assume that it is the case that some judges will get fatigued as the competition goes along."

Under scrutiny

"We have data since the Rio Olympics and what we observed is that some apparatus are intrinsically more difficult to judge," says Mercier of one of the patterns that have emerged since the start of the project.

"It's not the judges, because the judges move from apparatus to apparatus, so I cannot say judges in pommel horse are bad and judges involved in vault are good. No, it's completely different.

"Judging a routine in pommel horse is intrinsically much more difficult than judging a routine in vault, and it's very important for the tools we provide to be fair. So, I always tell the judges that the mathematical tools that we use take into account that judging in pommel horse is difficult.

"So, on a 7.5 routine, which is a good routine on pommel horse, a judge who makes an error of, let's say 0.3, 0.4 is completely normal. In vault, giving 9.5 instead of 9.1, it's a very, very, very large error... it's an outrageous difference. So, this is all scaled, so that when we analyse the accuracy of judges, we take into account the apparatus on which they judge."

Other factors include performance quality. An outstanding routine, for example, has very few deductions and the Code of Points is easily applied but a mediocre routine has more mistakes with many more deductions, which can be interpreted differently between judges.

"We tell the judges, if we analyse your performance in the all-around final or in the team final tonight, on balance beam, you will be evaluated and compared to judges that judge in similar circumstances. So, we take that into account so that our assessment of judging accuracy is fair to the judges as well."

The results from the project enable a ranking system of the most accurate and fair judges

"None of the judges are paid so the honour, or the most prestigious assignment, for the judges is to judge during the summer Olympics," explains Mercier.

"In some countries, it's very, very important because from smaller countries, having a judge go to the Olympics can be an immense source of pride for the Federation," he continues.

"It can encourage the participation at the youth level and... especially in countries where sports for women, for instance, are not as developed as sports for men, having a woman qualify to be a judge at the Olympics, sends a strong signal and I say, 'hey, look what we can do here'. This is amazing."

So, it's not just a numbers game for Mercier; it's statistics with heart.

"Judges have never been so scrutinised because everybody at home can press pause and say, 'Okay, let's play back in slow motion'," smiles Mercier. "But overall, across the board, judging is excellent."

He knows this because the data tells him so.

More from