Why I Think More NLP Researchers Should Engage with AI Safety Concerns

Large language modeling research in NLP seems to be feeding into much more impactful technologies than we’re used to working with. While the positive potential for this technology could be tremendous, the downside risk is also potentially catastrophic, and it doesn’t look like we’re prepared to manage that risk.

I’m starting a new research group at NYU to work on technical directions that I think are relevant to these concerns, and I’d encourage others in the field to look for ways to take these concerns into account as well.

Why I’m concerned

As a research community, we’re making progress quickly but chaotically.

Progress on language technology, by many important measures, has accelerated dramatically recently with the success of very-large-scale self-supervised training of simple neural networks. Most benchmarks that were considered credible in the field in 2018 or 2019 are now solved at human parity. While many of these benchmarks had identifiable weaknesses, this wasn’t the case across the board, and the overall rate of progress has been pretty shocking. (Both to me and to professional forecasters.) Whether or not our best systems are conscious or human-like in any important sense, they’re now creative and persuasive enough that they’ll occasionally convince people that they are.
- Spending any time interacting with the best available LLMs makes it clear that, whether or not they understand in some philosophically salient sense, they have rich enough syntax, semantics, pragmatics, and world modeling to be able to do many of the things that humans use language for, and they’re continuing to improve.
We’re increasingly making progress on hard problems by accident. GPT-3 was a good substrate for few-shot learning and chain-of-thought/scratchpad-style reasoning, despite not having been designed for either. The progress we’re making is not the result of careful plans by research teams.
The public discourse about large language models, especially in academic settings, often badly understates the pace of progress.
- It’s still easy to find extremely strong negative claims about these models from serious researchers, such as that LLMs don’t have the capacity to handle negation (i.e., that they won’t be appropriately sensitive to it even in typical cases that look like language model training data). Ten minutes of interaction with the OpenAI API sandbox should be sufficient to refute this, but the academic conversation often anchors primarily on older systems or on smaller systems that are practical to run on academic clusters.
There are clear incentives for this kind of progress to continue, even if no one institution or community has a clear picture of where we’re going. Deep Ganguli and Jack Clark (et al.) argue that scaling laws results imply to industry actors that building larger models will unlock some capabilities that are valuable enough to pay back the cost of training, even if we don’t know what those capabilities are and what risks come with them. These trends are enough that even significant regulation in the US or EU is unlikely to set back progress globally for more than a couple of years.

The field could nonetheless develop systems that are as good as we are at many important cognitive tasks.

It seems plausible that another decade or so of progress at this rate—through both novel scientific developments and continued increases in scale—could get us to human-like behavior on most aspects of language use, reasoning, and planning.
If this kind of human-level behavior is possible at all with deep learning, we’ll almost certainly see it within the next few decades. We’re already seeing experiments that come within a couple orders of magnitude of using the amount of computation done by a human brain over an entire lifetime.
- Whether the resulting systems would be human-like in any deep sense is not what’s at issue here. Whether we achieve human parity across all domains (especially including locomotion/robotics) is also not at issue here. Just achieving human-like behavior on some key aspects of language use, reasoning, and planning would enough to be extremely consequential.

If we build AI systems with human-level cognitive abilities, things could get very weird.

Pretty much every bad outcome we’re seeing from present-day NLP (and related technologies) could get a lot bigger and a lot worse.
- In particular, this is enough to get fine-grained surveillance and personalized persuasion to really work: Human-like cognitive abilities—plus cheap compute—would make it possible to deploy the equivalent of one censor or political strategist or intelligence service agent for every citizen in a country. It’s easy to imagine ways that even a clumsy implementation of something like this could lead to the rise of new stable totalitarian states.
At this point, I expect to start seeing people set systems up to act agentically and pursue relatively long-range goals. This doesn’t necessarily require that systems be trained from scratch in a way that gives them coherent goals or long-range planning out of the box, just that they’re separately able to (i) write plans, (ii) execute the steps of those plans, and (iii) update high-level plans in response to new low-level evidence.
- For a simple example, a competent language-and-code model that can act agentically would be able to design and build an entire large module or app, based only on a specification. This alone would be valuable enough to encourage people to experiment with this kind of deployment.
- If this works, we should see AI systems taking on many kinds of human-like professional work at large scales, at least in more loosely-regulated industries. This would concentrate power and wealth in the hands of the system owners to an even more destabilizing degree than we’ve seen from technology so far.
In the likely event that these capabilities emerge in a system that we don’t understand—like a modern deep learning model—that opens up an even wider range of bad outcomes.
- At this point, it’s easy to accidentally wind up in positions where an AI system is pursuing goals in ways that its owners or creators can’t meaningfully oversee or take responsibility for. Unless its goals accord very closely with human norms and values, this is likely to lead to power-seeking and deception. Why?
  - Most simple objective functions or goals become dangerous in the limit, usually because of secondary subgoals that emerge along the way:
    - Pursuing typical goals arbitrarily well requires a system to prevent itself from being turned off, by deception or force if needed.
    - Pursuing typical goals arbitrarily well requires acquiring any power or resources that could increase the chances of success, by deception or force if needed.
    - Toy example: If you really want to optimize the profitability of a product that you’re marketing, it helps to coerce customers into buying it and to coerce or overpower regulators and company managers so they don’t stop you.
- These really bad behaviors—power-seeking and deception—tend to require that a system has a decently clear picture of what kind of situation it’s in, who its operators are, and how they think. This isn’t something that we’re seeing in current models. While that fact is potentially reassuring, it also makes these risks harder to study and harder to mitigate through business-as-usual empirical ML research.
  - As an example: If you fine-tune a language model with the widely-used RLHF technique to be truthful and helpful, and you apparently succeed, the actual generalization you’ll observe at deployment varies depending on how much situational awareness the model has. In current regimes, you should expect something like “try to tell the truth”. For a sufficiently aware model, you should expect something like “try to say things your developers believe, whether or not they’re true”.
  - For this reason (and others), the problem of aligning systems such that they don’t show power-seeking or deceptive failure modes is difficult, and is likely to have weaker feedback loops than most other work in ML. Simple attempts to work around this problem have mostly been found to have serious flaws.

The only hard requirements for these risks are (i) that systems be competent enough and (ii) that systems be too opaque for us to be able to audit or supervise their reasoning processes reliably. The former seems like a plausible continuation of current scaling trends, and the latter seems like the default state of affairs if we continue to use end-to-end optimization and deep learning.

None of this means that AI progress has to be bad.

This line of work could produce some of the most valuable and broadly-beneficial tools we’ll ever invent. But we only get to realize that value if we’re thoughtful and careful about the direct risks of the technology and cognizant of the challenges of governing a world that contains such technology.

Wait, are you really just talking about Effective Altruism or Longtermism or Those Weird Rationalists or Peter Thiel or Crypto or Giving Legal Rights to AIs or Colonizing Mars?

As for Effective Altruism: a little, but not really. I’ve been in touch with people in the Effective Altruism intellectual orbit pretty often lately, initially because Giving What We Can inspired me to donate a bunch to effective (but non-AI-related) charities. That community tends to talk about the long-term trajectory of technological progress more than most, and that got some of these ideas on my radar, but they have no special claim on them.

As for the rest, no. There’s a confusing mess of communities that have some claim on the history of AI safety as a field, the basic premise is worth taking seriously regardless of that history.

Where does this leave us?

The details here are debatable, and I don’t claim to have a completely clear picture of these concerns. That said, I think the basic conclusions that are motivating me here are pretty intuitive, and that you can reach them from a bunch of angles. Stuart Russell uses the analogy that future AI systems could displace humanity as the most practically competent agents on earth. Occupying that second-place position hasn’t gone very well for the chimpanzees.

It is straightforwardly dangerous to build an intelligent system that is perceptive and strategic, has superhuman memory, has the ability to copy itself, and has the ability to work much faster than us… at least unless we’re very sure you understand how the system works and what (if anything) it’s trying to do.

What can one do about these concerns?

If you take the above seriously, it seems like we basically have two choices:

Stop all AI/ML research, anywhere in the world, that could get us close to powerful AI. This would require an unprecedented degree of global coordination and surveillance, and it’s not clear that it would be desirable, since it would snuff out much of the positive potential of AI.
or
Make sure all sufficiently powerful AI systems are built and deployed responsibly. This requires doing three very hard things:
1. build robust techniques to align powerful AI systems with the values and goals of their operators, so we don’t get accidental catastrophes,
2. ensure that those techniques are understood and used by any group that could plausibly build sufficiently powerful AI, and
3. ensure that we’re able to govern the operators of powerful AI systems in a way that makes their actions broadly compatible with democracy and positive for humanity as a whole.

I prefer option 2. This is what the AI safety community is trying to do, but it’s a hard problem, and the couple of hundred researchers who are currently doing this work—with about 0.1% as much funding as is being poured into accelerating AI progress—probably aren’t yet up to the challenge. There’s a lot of room to help out.

For people like me whose skills and experience mostly involve technical research in NLP, I think 2a—alignment research—is the obvious place to look, since it has loads of potentially-important open problems that look like NLP research.

If you’re interested and looking for a job: Of the places doing empirical work on language, I especially recommend recent work from Anthropic (where I’ve been spending time on sabbatical 👋), OpenAI, DeepMind, Redwood Research, and Jacob Steinhardt’s lab at Berkeley (examples in links). (Of course, some of these labs are also plausibly doing some of the most potentially dangerous work in this area, but to the extent they’re hiring people to work on safety, it seems worth making sure they succeed!) The Effective-Altruism-oriented career advice service 80,000 Hours has a good introduction to what it’s like to work in this area.

If you’re interested in getting involved and you could use funding: Like many other groups doing this kind of work, some of our funding comes from Open Philanthropy. They and a few other funders are trying to help grow this field and they’re often willing to fund big projects on these topics. If you’re interested in doing similar work elsewhere, I’d encourage you to look for funding there. I’m happy to try to give advice on this if it’s helpful.

The new lab

I’m setting up a new lab! It’s called the NYU Alignment Research Group. University labs are mostly just labels for clusters of people within existing departments rather than institutions with any weight, but this is meant to be a signal that I—and the ~dozen researchers who signed on to be part of it from the start—are committed to choosing research directions that try to address these risks.

What we’re doing

We’ll be doing empirical work with large language models for the foreseeable future. Beyond that, I expect our plans to evolve over time, but for the moment, we’re working on a slate of exploratory projects on topics like:

Concrete alignment strategies inspired by debate, amplification, and recursive reward modeling that attempt to use an AI system’s capabilities—even if they’re initially unreliable or unaligned—to allow us to bootstrap human oversight on difficult problems: These seem to me to be some of the best-vetted strategies for alignment, and while they aren’t a complete solution, they could plausibly be a large part of one if they worked. Despite this, little empirical work has been done so far to test their feasibility.
Sandwiching-style experimental protocols, where, roughly, we look for ways to pose artificial alignment challenges in which researchers need to reliably solve artificial tasks using unreliable AI/ML tools that have some knowledge or skill that’s necessary for the task.
Alignment-relevant properties of generalization in large language models: For example, when will a model that is aggressively fine-tuned to be truthful and calibrated on a simple class of questions going to be truthful and calibrated on more difficult questions?
Chain-of-thought-style reasoning in large language models: How far can a plain language model’s stated reasoning diverge from its actual behavior on tasks where it reliably succeeds? Are there fine-tuning strategies that meaningfully constrain this divergence?
Looking for additional ways to better understand (and communicate) how hard this problem is likely to be.
- Potentially including benchmarking and competition-building work, like the inverse scaling prize that several of us helped with.

Initial participants

Here’s the current group. Of course, not everyone here will endorse everything I said above. I’m excited to work with them!

Collaborating PI:

He He

Outside advisor:

Ethan Perez

Research scientists:

PhD students:

If you want to join us, we’ll probably hire another couple of PhD students in the upcoming application cycle, and we’ll probably put out another broad open call for non-student roles (junior researchers, research engineers, and postdocs) at least once next year.