A couple weeks ago, a popular account on twitter posted this:

This sparked a bit of discussion, including this quote tweet:

I think these tweets, particularly the second one, demonstrate some common misconceptions about evaluations and benchmarking that I’ve been seeing recently, so I figured this could be a useful case study to explore how we interpret benchmark scores, and what our goals should be when creating benchmarks.

How many questions contain mistakes?

The main concern Tanishq and typedfemale seem to be expressing is that the overall trustworthiness of GPQA is questionable, because there is a mistake in one of the questions. To help remove all doubt, I’m quite confident that there are actually many more questions in the dataset that contain mistakes, are ambiguously phrased, or don’t seem to have a correct answer. Specifically, I’d expect up to about a third of the questions to contain mistakes! The way we measure this is via expert validator agreement: we have experts in a domain try to answer each other’s questions in that domain, so we can see which questions they answer correctly (which is evidence that the question has a single objectively correct answer). Expert validator accuracy on the complete set of questions we collected was 65%, which means that we can expect up to 35% of those questions to contain mistakes.

However, these questions are very difficult, and even experts in the same subdomain can have non-overlapping knowledge, so we’d expect some high rate of expert validators simply making mistakes when answering the questions, which means some decent fraction of that 35% corresponds to expert validator mistakes on good questions. We manually read through all of the expert validator responses, and it turned out that 25% of the time they answered differently from the question writer, they explicitly said or described how they made a mistake, which puts expert agreement on the complete set of questions we collected at 74%.

Furthermore, this accuracy is computed on the extended set, which consists of all of the questions we solicited from writers, before we did any filtering. The diamond set is the one that is primarily being used for evaluation, and the diamond set consists only of questions where both expert validators answered correctly, or where one answered correctly, and the other made a clear demonstrable mistake when answering. This would naively suggest that 100% of the diamond set questions are objective. However, it’s possible for expert validators to mistakenly agree with the question writer (e.g. if they choose the right answer for a different/wrong reason compared to the question writer), so the true proportion of objectively correct questions on the diamond set likely lies somewhere between 74% and 100%, depending on how likely these false positives are. My guess is that the probability of false positives is pretty low, but this is just an intuition, not backed up by any quantitative measurement (it would be super interesting to try and estimate how often these false positives come about, but the only way I can think to do this would be to have the question writers and expert validators sit down and talk through each question, to see if they have/had the same understanding).

In sum, yep, probably some fraction of the questions in the diamond set contain mistakes!

How bad is it that datasets contain mistakes?

What does it mean that the diamond set of GPQA likely contains some mistakes? An implicit assumption that I often see in discussions of benchmarking is that any problems in a dataset invalidate it, and make it useless. While we should strive to minimize the number of questions in our benchmarks containing mistakes, even benchmarks with large fractions of mistakes can still be very useful.

In the case of GPQA, assume for a moment we’re not filtering down to the diamond set, and we’re just evaluating models on the extended set, with its (up to) 26% question mistake rate. So, as models start to get higher accuracy, we interpret this to mean that they have more of the skills/knowledge that we think are needed to answer the questions—in this case, knowledge and reasoning capabilities in biology, physics, and chemistry (assuming models learn to correctly answer the good questions before they learn to “correctly” answer questions with mistakes). However, once they pass 74% accuracy, we wouldn’t interpret this as purely continued improvement at the specified task (answering hard science questions correctly), because the questions that don’t have expert agreement likely contain mistakes. This means we can simply use the expert agreement rate as an estimate of the accuracy ceiling on the benchmark, such that we may not want to continue looking at differences between models once they are past that ceiling (with an interesting caveat I’ll discuss in a moment).

To summarize, if you can estimate the fraction of questions in a benchmark that are objective (as we do with expert agreement), you can just use that fraction as a ceiling that you don’t measure performance beyond. So the real consequence of having questions with mistakes is that you lose some range at the upper end of accuracy, which can be an issue eventually, since models are improving at pretty ridiculous rate these days, but doesn’t invalidate results up until then. The most important and challenging thing in evaluation is validity: does the evaluation actually measure the thing you care about, or that you’re trying to measure. This is very hard to get right, and common ways of creating benchmarks (e.g. trying to cleverly scrape internet data, or using models in weird ways to generate the questions) that don’t involve the difficult and messy human creation process often sacrifice validity pretty severely, despite often making it easier to achieve very high expert agreement rates.

What if you can get the “right” answer to a bad question?

An interesting caveat to using expert agreement a a ceiling: in a lot of cases, it seems likely to me that questions that contain mistakes are actually still answerable consistently, if the mistakes themselves are systematic or predictable. For instance, there may be common misconceptions that question writers rely on, or there could be errors that are obviously just typographical. In the physics question pointed to by typedfemale, for example, one of the issues they point out is that the phrasing around Kraus operators is weird or incorrect. However, despite this, the first expert validator agreed with the question writer:

“I made a careless mistake, indeed. This is a tricky question in a graduate-level quantum computation course. The first impulse of the solver is to launch through the calculations, but the solver could have found the right answer by elimination by finding out which choices satisfy certain conditions (Hermitian and completeness). I liked the question a lot.”

The second expert validation of the question actually points the phrasing issue out directly:

“The part of the question ‘If the Kraus operators of the given state are’ is wrong as it is not states that have Kraus operators (representation) but quantum channels (or maps between linear operators to be more general). I would correct this as ‘If the Kraus operators of the depolarizing channel are’.

For the same reason, the line ‘What could be the correct Kraus Representation of the state E(\\rho).’ is incorrect. Instead, the line should be ‘What is the correct way to write the (depolarized) state E(\\rho)’.

Ref. See this wiki part (https://www.wikiwand.com/en/Quantum_operation#Kraus_operators) or see the section on Kraus operators in Theory of Quantum Information by John Watrous.”

However, we only did one round of question revision after the first expert validator, which is why the issue wasn’t fixed. There was an additional problem with this question (A_0 should equal sqrt(3/4 – 3p/4), instead of sqrt(1-3p/4)), but a physicist who reached out to me after we released the dataset said that they could tell that the “correct” answer was indeed correct (because of linearity in rho and completeness), which shows how an expert can answer consistently with the question writer, despite the question containing a mistake.

So we’ve established that there can be systematic/predictable mistakes in questions, but what does this suggest about the accuracy ceiling? If models start doing significantly better than the accuracy ceiling, this means that they are getting better at predicting mistakes in the questions systematically (while also not being too correlated with the mistakes that expert validators made). So, we can still see whether one model is “better” than another, but the interpretation of how it’s better is a bit different—beyond the accuracy ceiling, improvements will largely come from models’ improved ability to predict/understand the mistakes that question writers are likely to make. This is an interesting task in itself, because being good at modeling/understanding mistakes is an important skill in many domains, and this can allow us to get more juice out of benchmarks that might otherwise be limited by a low rate of objectivity.

For GPQA specifically, as I discussed in the first section, the actual rate of mistakes on the diamond set is probably pretty small (it’s the probability that an expert validator gets a question correct for the wrong reason(s), given that they got the question correct), so I think it’s valid to use the diamond set to evaluate pure capabilities past 74% model accuracy, but it’s just important to be aware that at some point, the interpretation of improved model accuracy will increasingly be “the model is getting better at predicting the mistakes question writers are making”.

How to make hard datasets with fewer mistakes?

Decreasing the number of mistakes in a realistic, manually-constructed benchmark like GPQA is pretty tough. A big constraint is just cost: GPQA cost ~$120k to produce (not including my salary), which sounds like a lot until you start to break down the components. We on average paid experts almost $100/hr on average, which could break down as 30 minutes to write each question, 15 minutes for each expert validation (two per question), and 20 minutes for each non-expert validation (three per question), implying 2 hours of expert time per question (the actual numbers are a bit different, but they roughly average out to this). You could easily imagine having experts spend much more time on each question (in fact, non-expert validators spent over 35 minutes on average per question, out of their own motivation/interest and because we had large bonuses to incentivize actually answering the questions correctly), such that you can reach high six or even seven figure costs, which also trades off heavily with the scale and number of questions you can collect.

Beyond cost though, actually just recruiting people who have serious deep expertise in relevant domains is pretty tough. For the most part, the people with the most expertise already have good jobs and aren’t looking for extra income, so you need to rely on intrinsic motivation around contributing to AI evaluations.

Concluding Thoughts

I do think we should strive for benchmarks that have as high expert agreement as possible, but more importantly we should strive for benchmarks that are valid—where we can see how well models do, and come away with confident beliefs about how models will perform and generalize on tasks we actually care about. I think making benchmarks like this is mostly bottlenecked on hard work and motivation—it’s just a lot of difficult operational/logistical work, which is pretty different from sitting at a whiteboard and having brilliant mathematical insights about the nature of intelligence, but there’s so much low-hanging fruit that, if you put in the work, you can have a lot of impact.

Thanks to Sam Bowman and Julian Michael for helpful comments and discussion on this post.