Can Good Benchmarks Contain Mistakes?

A couple weeks ago, a popular account on twitter posted this:

This sparked a bit of discussion, including this quote tweet:

I think these tweets, particularly the second one, demonstrate some common misconceptions about evaluations and benchmarking that I’ve been seeing recently, so I figured this could be a useful case study to explore how we interpret benchmark scores, and what our goals should be when creating benchmarks.

How many questions contain mistakes?

The main concern Tanishq and typedfemale seem to be expressing is that the overall trustworthiness of GPQA is questionable, because there is a mistake in one of the questions. To help remove all doubt, I’m quite confident that there are actually many more questions in the dataset that contain mistakes, are ambiguously phrased, or don’t seem to have a correct answer. Specifically, I’d expect up to about a third of the questions to contain mistakes! The way we measure this is via expert validator agreement: we have experts in a domain try to answer each other’s questions in that domain, so we can see which questions they answer correctly (which is evidence that the question has a single objectively correct answer). Expert validator accuracy on the complete set of questions we collected was 65%, which means that we can expect up to 35% of those questions to contain mistakes.

However, these questions are very difficult, and even experts in the same subdomain can have non-overlapping knowledge, so we’d expect some high rate of expert validators simply making mistakes when answering the questions, which means some decent fraction of that 35% corresponds to expert validator mistakes on good questions. We manually read through all of the expert validator responses, and it turned out that 25% of the time they answered differently from the question writer, they explicitly said or described how they made a mistake, which puts expert agreement on the complete set of questions we collected at 74%.

Furthermore, this accuracy is computed on the extended set, which consists of all of the questions we solicited from writers, before we did any filtering. The diamond set is the one that is primarily being used for evaluation, and the diamond set consists only of questions where both expert validators answered correctly, or where one answered correctly, and the other made a clear demonstrable mistake when answering. This would naively suggest that 100% of the diamond set questions are objective. However, it’s possible for expert validators to mistakenly agree with the question writer (e.g. if they choose the right answer for a different/wrong reason compared to the question writer), so the true proportion of objectively correct questions on the diamond set likely lies somewhere between 74% and 100%, depending on how likely these false positives are. My guess is that the probability of false positives is pretty low, but this is just an intuition, not backed up by any quantitative measurement (it would be super interesting to try and estimate how often these false positives come about, but the only way I can think to do this would be to have the question writers and expert validators sit down and talk through each question, to see if they have/had the same understanding).

In sum, yep, probably some fraction of the questions in the diamond set contain mistakes!

How bad is it that datasets contain mistakes?

What does it mean that the diamond set of GPQA likely contains some mistakes? An implicit assumption that I often see in discussions of benchmarking is that any problems in a dataset invalidate it, and make it useless. While we should strive to minimize the number of questions in our benchmarks containing mistakes, even benchmarks with large fractions of mistakes can still be very useful.

In the case of GPQA, assume for a moment we’re not filtering down to the diamond set, and we’re just evaluating models on the extended set, with its (up to) 26% question mistake rate. So, as models start to get higher accuracy, we interpret this to mean that they have more of the skills/knowledge that we think are needed to answer the questions—in this case, knowledge and reasoning capabilities in biology, physics, and chemistry (assuming models learn to correctly answer the good questions before they learn to “correctly” answer questions with mistakes). However, once they pass 74% accuracy, we wouldn’t interpret this as purely continued improvement at the specified task (answering hard science questions correctly), because the questions that don’t have expert agreement likely contain mistakes. This means we can simply use the expert agreement rate as an estimate of the accuracy ceiling on the benchmark, such that we may not want to continue looking at differences between models once they are past that ceiling (with an interesting caveat I’ll discuss in a moment).

To summarize, if you can estimate the fraction of questions in a benchmark that are objective (as we do with expert agreement), you can just use that fraction as a ceiling that you don’t measure performance beyond. So the real consequence of having questions with mistakes is that you lose some range at the upper end of accuracy, which can be an issue eventually, since models are improving at pretty ridiculous rate these days, but doesn’t invalidate results up until then. The most important and challenging thing in evaluation is validity: does the evaluation actually measure the thing you care about, or that you’re trying to measure. This is very hard to get right, and common ways of creating benchmarks (e.g. trying to cleverly scrape internet data, or using models in weird ways to generate the questions) that don’t involve the difficult and messy human creation process often sacrifice validity pretty severely, despite often making it easier to achieve very high expert agreement rates.

What if you can get the “right” answer to a bad question?

An interesting caveat to using expert agreement a a ceiling: in a lot of cases, it seems likely to me that questions that contain mistakes are actually still answerable consistently, if the mistakes themselves are systematic or predictable. For instance, there may be common misconceptions that question writers rely on, or there could be errors that are obviously just typographical. In the physics question pointed to by typedfemale, for example, one of the issues they point out is that the phrasing around Kraus operators is weird or incorrect. However, despite this, the first expert validator agreed with the question writer:

“I made a careless mistake, indeed. This is a tricky question in a graduate-level quantum computation course. The first impulse of the solver is to launch through the calculations, but the solver could have found the right answer by elimination by finding out which choices satisfy certain conditions (Hermitian and completeness). I liked the question a lot.”

The second expert validation of the question actually points the phrasing issue out directly:

“The part of the question ‘If the Kraus operators of the given state are’ is wrong as it is not states that have Kraus operators (representation) but quantum channels (or maps between linear operators to be more general). I would correct this as ‘If the Kraus operators of the depolarizing channel are’.

For the same reason, the line ‘What could be the correct Kraus Representation of the state E(\\rho).’ is incorrect. Instead, the line should be ‘What is the correct way to write the (depolarized) state E(\\rho)’.

Ref. See this wiki part (https://www.wikiwand.com/en/Quantum_operation#Kraus_operators) or see the section on Kraus operators in Theory of Quantum Information by John Watrous.”

However, we only did one round of question revision after the first expert validator, which is why the issue wasn’t fixed. There was an additional problem with this question (A_0 should equal sqrt(3/4 – 3p/4), instead of sqrt(1-3p/4)), but a physicist who reached out to me after we released the dataset said that they could tell that the “correct” answer was indeed correct (because of linearity in rho and completeness), which shows how an expert can answer consistently with the question writer, despite the question containing a mistake.

So we’ve established that there can be systematic/predictable mistakes in questions, but what does this suggest about the accuracy ceiling? If models start doing significantly better than the accuracy ceiling, this means that they are getting better at predicting mistakes in the questions systematically (while also not being too correlated with the mistakes that expert validators made). So, we can still see whether one model is “better” than another, but the interpretation of how it’s better is a bit different—beyond the accuracy ceiling, improvements will largely come from models’ improved ability to predict/understand the mistakes that question writers are likely to make. This is an interesting task in itself, because being good at modeling/understanding mistakes is an important skill in many domains, and this can allow us to get more juice out of benchmarks that might otherwise be limited by a low rate of objectivity.

For GPQA specifically, as I discussed in the first section, the actual rate of mistakes on the diamond set is probably pretty small (it’s the probability that an expert validator gets a question correct for the wrong reason(s), given that they got the question correct), so I think it’s valid to use the diamond set to evaluate pure capabilities past 74% model accuracy, but it’s just important to be aware that at some point, the interpretation of improved model accuracy will increasingly be “the model is getting better at predicting the mistakes question writers are making”.

How to make hard datasets with fewer mistakes?

Decreasing the number of mistakes in a realistic, manually-constructed benchmark like GPQA is pretty tough. A big constraint is just cost: GPQA cost ~$120k to produce (not including my salary), which sounds like a lot until you start to break down the components. We on average paid experts almost $100/hr on average, which could break down as 30 minutes to write each question, 15 minutes for each expert validation (two per question), and 20 minutes for each non-expert validation (three per question), implying 2 hours of expert time per question (the actual numbers are a bit different, but they roughly average out to this). You could easily imagine having experts spend much more time on each question (in fact, non-expert validators spent over 35 minutes on average per question, out of their own motivation/interest and because we had large bonuses to incentivize actually answering the questions correctly), such that you can reach high six or even seven figure costs, which also trades off heavily with the scale and number of questions you can collect.

Beyond cost though, actually just recruiting people who have serious deep expertise in relevant domains is pretty tough. For the most part, the people with the most expertise already have good jobs and aren’t looking for extra income, so you need to rely on intrinsic motivation around contributing to AI evaluations.

Concluding Thoughts

I do think we should strive for benchmarks that have as high expert agreement as possible, but more importantly we should strive for benchmarks that are valid—where we can see how well models do, and come away with confident beliefs about how models will perform and generalize on tasks we actually care about. I think making benchmarks like this is mostly bottlenecked on hard work and motivation—it’s just a lot of difficult operational/logistical work, which is pretty different from sitting at a whiteboard and having brilliant mathematical insights about the nature of intelligence, but there’s so much low-hanging fruit that, if you put in the work, you can have a lot of impact.

Thanks to Sam Bowman and Julian Michael for helpful comments and discussion on this post.

Eight Things to Know about Large Language Models

I’m sharing a draft of a slightly-opinionated survey paper I’ve been working on for the last couple of months: Eight Things to Know about Large Language Models. Here are the eight things:

  1. LLMs predictably get more capable with increasing investment, even without targeted innovation.
  2. Many important LLM behaviors emerge unpredictably as a byproduct of increasing investment.
  3. LLMs often appear to learn and use representations of the outside world.
  4. There are no reliable techniques for steering the behavior of LLMs.
  5. Experts are not yet able to interpret the inner workings of LLMs.
  6. Human performance on a task isn’t an upper bound on LLM performance.
  7. LLMs need not express the values of their creators nor the values encoded in web text.
  8. Brief interactions with LLMs chatbots are often misleading.
An enormous number of people—including journalists, advocates, lawmakers, and academics—have started to pay attention to this technology in the last few months. This is appropriate: The technology is on track to be really impactful, and we want the full force of government and civil society to be involved in figuring out what we do with it. I’m aiming for this paper to cover points that are relevant to some of these decisions, but that might be easy to miss for someone just starting to follow the technology. I also considered calling it “Eight Ways that Large Language Models are a Weird Technology”. 
It’s a survey: All of the evidence I use was published by others, and most of the arguments have already been stated clearly by others. (When in doubt, cite them, not me.)
All of these claims should seem obvious to at least some large subset of the researchers who build and test these models, and there’s good evidence for all of them, though some of them are still controversial—I try to point out where that’s the case.
I also close with some less survey-ish discussion that riffs on the above. Teasers:
  • We should expect some of the prominent flaws of current LLMs to improve significantly.
  • There will be incentives to deploy LLMs as agents that flexibly pursue goals.
  • LLM developers have limited influence over what is developed.
  • LLMs are likely to produce a rapidly growing array of risks.
  • Negative results with LLMs can be difficult to interpret but point to areas of real weakness.
  • The science and scholarship around LLMs is especially immature.

Why I Think More NLP Researchers Should Engage with AI Safety Concerns

Large language modeling research in NLP seems to be feeding into much more impactful technologies than we’re used to working with. While the positive potential for this technology could be tremendous, the downside risk is also potentially catastrophic, and it doesn’t look like we’re prepared to manage that risk. 

I’m starting a new research group at NYU to work on technical directions that I think are relevant to these concerns, and I’d encourage others in the field to look for ways to take these concerns into account as well.

Why I’m concerned

As a research community, we’re making progress quickly but chaotically.

  • Progress on language technology, by many important measures, has accelerated dramatically recently with the success of very-large-scale self-supervised training of simple neural networks. Most benchmarks that were considered credible in the field in 2018 or 2019 are now solved at human parity. While many of these benchmarks had identifiable weaknesses, this wasn’t the case across the board, and the overall rate of progress has been pretty shocking. (Both to me and to professional forecasters.) Whether or not our best systems are conscious or human-like in any important sense, they’re now creative and persuasive enough that they’ll occasionally convince people that they are.
    • Spending any time interacting with the best available LLMs makes it clear that, whether or not they understand in some philosophically salient sense, they have rich enough syntax, semantics, pragmatics, and world modeling to be able to do many of the things that humans use language for, and they’re continuing to improve.
  • We’re increasingly making progress on hard problems by accident. GPT-3 was a good substrate for few-shot learning and chain-of-thought/scratchpad-style reasoning, despite not having been designed for either. The progress we’re making is not the result of careful plans by research teams.
  • The public discourse about large language models, especially in academic settings, often badly understates the pace of progress.
    • It’s still easy to find extremely strong negative claims about these models from serious researchers, such as that LLMs don’t have the capacity to handle negation (i.e., that they won’t be appropriately sensitive to it even in typical cases that look like language model training data). Ten minutes of interaction with the OpenAI API sandbox should be sufficient to refute this, but the academic conversation often anchors primarily on older systems or on smaller systems that are practical to run on academic clusters.
  • There are clear incentives for this kind of progress to continue, even if no one institution or community has a clear picture of where we’re going. Deep Ganguli and Jack Clark (et al.) argue that  scaling laws results imply to industry actors that building larger models will unlock some capabilities that are valuable enough to pay back the cost of training, even if we don’t know what those capabilities are and what risks come with them. These trends are enough that even significant regulation in the US or EU is unlikely to set back progress globally for more than a couple of years.

The field could nonetheless develop systems that are as good as we are at many important cognitive tasks.

  • It seems plausible that another decade or so of progress at this rate—through both novel scientific developments and continued increases in scale—could get us to human-like behavior on most aspects of language use, reasoning, and planning.
  • If this kind of human-level behavior is possible at all with deep learning, we’ll almost certainly see it within the next few decades. We’re already seeing experiments that come within a couple orders of magnitude of using the amount of computation done by a human brain over an entire lifetime.
    • Whether the resulting systems would be human-like in any deep sense is not what’s at issue here. Whether we achieve human parity across all domains (especially including locomotion/robotics) is also not at issue here. Just achieving human-like behavior on some key aspects of language use, reasoning, and planning would enough to be extremely consequential.

If we build AI systems with human-level cognitive abilities, things could get very weird.

  • Pretty much every bad outcome we’re seeing from present-day NLP (and related technologies) could get a lot bigger and a lot worse.
    • In particular, this is enough to get fine-grained surveillance and personalized persuasion to really work: Human-like cognitive abilities—plus cheap compute—would make it possible to deploy the equivalent of one censor or political strategist or intelligence service agent for every citizen in a country. It’s easy to imagine ways that even a clumsy implementation of something like this could lead to the rise of new stable totalitarian states.
  • At this point, I expect to start seeing people set systems up to act agentically and pursue relatively long-range goals. This doesn’t necessarily require that systems be trained from scratch in a way that gives them coherent goals or long-range planning out of the box, just that they’re separately able to (i) write plans, (ii) execute the steps of those plans, and (iii) update high-level plans in response to new low-level evidence.
    • For a simple example, a competent language-and-code model that can act agentically would be able to design and build an entire large module or app, based only on a specification. This alone would be valuable enough to encourage people to experiment with this kind of deployment.
    • If this works, we should see AI systems taking on many kinds of human-like professional work at large scales, at least in more loosely-regulated industries. This would concentrate power and wealth in the hands of the system owners to an even more destabilizing degree than we’ve seen from technology so far.
  • In the likely event that these capabilities emerge in a system that we don’t understand—like a modern deep learning model—that opens up an even wider range of bad outcomes. 
    • At this point, it’s easy to accidentally wind up in positions where an AI system is pursuing goals in ways that its owners or creators can’t meaningfully oversee or take responsibility for. Unless its goals accord very closely with human norms and values, this is likely to lead to power-seeking and deception. Why?
      • Most simple objective functions or goals become dangerous in the limit, usually because of secondary subgoals that emerge along the way:
        • Pursuing typical goals arbitrarily well requires a system to prevent itself from being turned off, by deception or force if needed.
        • Pursuing typical goals arbitrarily well requires acquiring any power or resources that could increase the chances of success, by deception or force if needed.
        • Toy example: If you really want to optimize the profitability of a product that you’re marketing, it helps to coerce customers into buying it and to coerce or overpower regulators and company managers so they don’t stop you.
    • These really bad behaviors—power-seeking and deception—tend to require that a system has a decently clear picture of what kind of situation it’s in, who its operators are, and how they think. This isn’t something that we’re seeing in current models. While that fact is potentially reassuring, it also makes these risks harder to study and harder to mitigate through business-as-usual empirical ML research.
      • As an example: If you fine-tune a language model with the widely-used RLHF technique to be truthful and helpful, and you apparently succeed, the actual generalization you’ll observe at deployment varies depending on how much situational awareness the model has. In current regimes, you should expect something like “try to tell the truth”. For a sufficiently aware model, you should expect something like “try to say things your developers believe, whether or not they’re true”.
      • For this reason (and others), the problem of aligning systems such that they don’t show power-seeking or deceptive failure modes is difficult, and is likely to have weaker feedback loops than most other work in ML. Simple attempts to work around this problem have mostly been found to have serious flaws.

The only hard requirements for these risks are (i) that systems be competent enough and (ii) that systems be too opaque for us to be able to audit or supervise their reasoning processes reliably. The former seems like a plausible continuation of current scaling trends, and the latter seems like the default state of affairs if we continue to use end-to-end optimization and deep learning.

None of this means that AI progress has to be bad. 

This line of work could produce some of the most valuable and broadly-beneficial tools we’ll ever invent. But we only get to realize that value if we’re thoughtful and careful about the direct risks of the technology and cognizant of the challenges of governing a world that contains such technology.

Wait, are you really just talking about Effective Altruism or Longtermism or Those Weird Rationalists or Peter Thiel or Crypto or Giving Legal Rights to AIs or Colonizing Mars?

As for Effective Altruism: a little, but not really. I’ve been in touch with people in the Effective Altruism intellectual orbit pretty often lately, initially because Giving What We Can inspired me to donate a bunch to effective (but non-AI-related) charities. That community tends to talk about the long-term trajectory of technological progress more than most, and that got some of these ideas on my radar, but they have no special claim on them.

As for the rest, no. There’s a confusing mess of communities that have some claim on the history of AI safety as a field, the basic premise is worth taking seriously regardless of that history.

Where does this leave us?

The details here are debatable, and I don’t claim to have a completely clear picture of these concerns. That said, I think the basic conclusions that are motivating me here are pretty intuitive, and that you can reach them from a bunch of angles. Stuart Russell uses the analogy that future AI systems could displace humanity as the most practically competent agents on earth. Occupying that second-place position hasn’t gone very well for the chimpanzees. 

It is straightforwardly dangerous to build an intelligent system that is perceptive and strategic, has superhuman memory, has the ability to copy itself, and has the ability to work much faster than us… at least unless we’re very sure you understand how the system works and what (if anything) it’s trying to do.

What can one do about these concerns?

If you take the above seriously, it seems like we basically have two choices:

  1. Stop all AI/ML research, anywhere in the world, that could get us close to powerful AI. This would require an unprecedented degree of global coordination and surveillance, and it’s not clear that it would be desirable, since it would snuff out much of the positive potential of AI.
    or
  2. Make sure all sufficiently powerful AI systems are built and deployed responsibly. This requires doing three very hard things:
    1. build robust techniques to align powerful AI systems with the values and goals of their operators, so we don’t get accidental catastrophes,
    2. ensure that those techniques are understood and used by any group that could plausibly build sufficiently powerful AI, and 
    3. ensure that we’re able to govern the operators of powerful AI systems in a way that makes their actions broadly compatible with democracy and positive for humanity as a whole.

I prefer option 2. This is what the AI safety community is trying to do, but it’s a hard problem, and the couple of hundred researchers who are currently doing this work—with about 0.1% as much funding as is being poured into accelerating AI progress—probably aren’t yet up to the challenge. There’s a lot of room to help out.

For people like me whose skills and experience mostly involve technical research in NLP, I think 2a—alignment research—is the obvious place to look, since it has loads of potentially-important open problems that look like NLP research.

If you’re interested and looking for a job: Of the places doing empirical work on language, I especially recommend recent work from Anthropic (where I’ve been spending time on sabbatical 👋), OpenAI, DeepMind, Redwood Research, and Jacob Steinhardt’s lab at Berkeley (examples in links). (Of course, some of these labs are also plausibly doing some of the most potentially dangerous work in this area, but to the extent they’re hiring people to work on safety, it seems worth making sure they succeed!) The Effective-Altruism-oriented career advice service 80,000 Hours has a good introduction to what it’s like to work in this area.

If you’re interested in getting involved and you could use funding: Like many other groups doing this kind of work, some of our funding comes from Open Philanthropy. They and a few other funders are trying to help grow this field and they’re often willing to fund big projects on these topics. If you’re interested in doing similar work elsewhere, I’d encourage you to look for funding there. I’m happy to try to give advice on this if it’s helpful.

The new lab

I’m setting up a new lab! It’s called the NYU Alignment Research Group. University labs are mostly just labels for clusters of people within existing departments rather than institutions with any weight, but this is meant to be a signal that I—and the ~dozen researchers who signed on to be part of it from the start—are committed to choosing research directions that try to address these risks.

What we’re doing

We’ll be doing empirical work with large language models for the foreseeable future. Beyond that, I expect our plans to evolve over time, but for the moment, we’re working on a slate of exploratory projects on topics like:

  • Concrete alignment strategies inspired by debate, amplification, and recursive reward modeling that attempt to use an AI system’s capabilities—even if they’re initially unreliable or unaligned—to allow us to bootstrap human oversight on difficult problems: These seem to me to be some of the best-vetted strategies for alignment, and while they aren’t a complete solution, they could plausibly be a large part of one if they worked. Despite this, little empirical work has been done so far to test their feasibility.
  • Sandwiching-style experimental protocols, where, roughly, we look for ways to pose artificial alignment challenges in which researchers need to reliably solve artificial tasks using unreliable AI/ML tools that have some knowledge or skill that’s necessary for the task.
  • Alignment-relevant properties of generalization in large language models: For example, when will a model that is aggressively fine-tuned to be truthful and calibrated on a simple class of questions going to be truthful and calibrated on more difficult questions?
  • Chain-of-thought-style reasoning in large language models: How far can a plain language model’s stated reasoning diverge from its actual behavior on tasks where it reliably succeeds? Are there fine-tuning strategies that meaningfully constrain this divergence?
  • Looking for additional ways to better understand (and communicate) how hard this problem is likely to be.
    • Potentially including benchmarking and competition-building work, like the inverse scaling prize that several of us helped with.

Initial participants

Here’s the current group. Of course, not everyone here will endorse everything I said above. I’m excited to work with them!

Collaborating PI:

Outside advisor: 

Research scientists:

PhD students: 

If you want to join us, we’ll probably hire another couple of PhD students in the upcoming application cycle, and we’ll probably put out another broad open call for non-student roles (junior researchers, research engineers, and postdocs) at least once next year.