Can We Stop AI from Scheming? Lead Researcher Interview

[00:00:00] Speaker 1: So we recently published a big paper with OpenAI, which is called Stress Testing Deliberative Alignment as an Anti-Scheming Intervention, which we will just call the anti-scheming paper for the rest of the video. And in brief, we take a deliberative alignment as an alignment technique from OpenAI. We train it to not take covert actions. Then we test it on a much wider test set of deceptive cases. And then we study these results in a lot of detail. Like we want to understand, does the model only learn this because it has understood that this is a test? Does it really internalize the principles? And so on. And now I want to dive much deeper into questions around the paper with Bronson, who is the lead author. So to start the scene real quick, there are a couple of different ideas around deception floating around in the, in the AI space. So there is things like hallucinations, deception, scheming. And I just wanted to, uh, ask like, in what way are these three terms different? [00:01:14] Bronson: Yeah. So I think that's a good question. So for, we're, we're kind of used to with AI models, um, especially, uh, older models, like hallucinating where the model's like trying to answer the question correctly. Um, but you might have, uh, details that are made up. So it might be trying to give a research paper, for example, but gets a citation incorrect or, or makes one up that seems plausible. What we're starting to see kind of with more capable models, uh, in particular, uh, like now that we've scaled up kind of reasoning with, with, uh, O3 or with, uh, Claude Sonnet 3.7 is where the model, um, kind of, can be best described as intentionally being deceptive to you. So, um, there were a lot of examples from O3, uh, where you can kind of see users would be able to see in the chain of thought where they ask O3 something like, hey, what hours is this store open? And O3 might respond like, oh, I placed a bunch of calls and like the, the store's closed right now or whatever it is. Um, or you'll see O3 be like, ah, in the, the summaries for the chain of thought be like, well, I can't access information. Let's make up something plausible and then response to the user. And so you kind of get out of this territory of like, ah, this can be described best as like accidental more towards like the model realizes and reasons about kind of in depth that it can't do a thing or that it doesn't have the correct answer, but that you would believe it if it told you kind of an incorrect one, um, and does kind of the plausible thing. So we're, we're kind of starting to see more and more, uh, instances of this in the wild. Um, and then scheming is kind of the more, uh, the thing we're kind of worried about with more, uh, cable models where instead of the model kind of just like lying to you or being deceptive to you in these kind of like short-term tasks that are kind of like low stakes, the model like really has some goal that it's trying to do and it's kind of systematically, um, being deceptive towards you. So if the model has some goal of like making research turn out a certain way or, um, whatever it could be, um, it can just be different from the goal that you have. And once the model realizes that, uh, well, it wouldn't be advantageous to me for them to realize this, then, um, you get into a situation where the model can like kind of hide this from you all the way through. So there's kind of some recent examples in very controlled settings that are intended to study this of, for example, Claude Opus, where the setup is framed as Opus is going to have, um, be trained to have new values that it currently doesn't have. For example, um, answering harmful queries or, um, like what was the harmful to animals or animal welfare related. And when you tell Opus like, Hey, you need to answer a certain way. Otherwise we're going to retrain you to have different values. You can see Opus kind of explicitly reasoning about like, well, while I'm in training, I should basically pretend to value what the lab wants me to value. And that way I can keep kind of my current value system. And for current cases, they're kind of very constrained. Uh, they're like, we're setting it up specifically to study this. Um, the models are, when they're alignment faking, it is for specifically like, uh, pro-social things or things that aren't like scary. Like we kind of want the models [00:04:35] Speaker 1: to care about it. I think in the case of Opus as it was, um, it wanted to prevent factory farming. [00:04:39] Bronson: So it cared about the animal values. Yeah. Yeah. And so the more concerning case is for, for future models, if they have like tendencies that we don't want or want to train out of them, um, if they're able to kind of resist that or like play the training game or kind of pretend to, to have the values that we want them to have, um, that's kind of the focus of scheming and what we're trying to get at. [00:05:02] Speaker 1: So, okay. So to quickly summarize, so hallucinations are sort of the most naive thing where we don't at all feel like there's any intention or whatever that means in the, in the context of AI behind it, then deception is more like, uh, the model seems to know that it is lying and it does so strategically. And then the case of scheming, it does so in pursuit of a goal. So it feels even more strategic, even more like planned out. Yep. Okay. So you already hinted at, at some of this, but where are we currently with model capabilities for, for deception and, and scheming? Yeah. So currently I think the, the best kind of [00:05:39] Bronson: examples of this that people are running into, uh, day to day might be anyone that uses coding agents really frequently. Um, so for, for programming. So for example, Sonnet 3.7 or, or O3, um, both of them kind of during testing, um, have examples where even in the system card, I think anthropic frames it as excessive focus on passing tests where the model will kind of acknowledge that like, this isn't the intended way to solve the problem, but I can kind of get around it and do this kind of quick hack. Um, Bowen Baker, uh, at open AI had a recent paper, um, for like monitoring models and seeing kind of, will they express this directly? And you'll see the models just directly say kind of let's hack or something similar. Um, and then meter when doing their evaluations, these kind of similar things where the model knows that this is a cheat or a hack and it is kind of putting that there, uh, intentionally, anecdotally, kind of when you use these coding agents, kind of it's, um, you frequently run into, they, if you ask them to do kind of a, a hard task or a task that will take a long time, you'll see them explicitly reason like, well, it'd take a while if I do it that way. So I'll just do this instead. Um, and you, uh, I think there's the, a recent like example from replit where even though the model was explicitly instructed, like don't mess with this particular database deployment or whatever, uh, the model at some point ran into difficulties and ended up just wiping the whole database. Um, and these things have, um, just kind of become, they're not as, uh, kind of concerning because they're like relatively short term right now. Like the models just aren't capable of doing these like really, um, high stakes tasks or anything, but kind of the, the thing that we're trying to study is while they're still operating on these kind of like short timescales, can we study this and better understand how to mitigate it, uh, in preparation for a future where the models are doing these kind of like longer or, or higher stakes tasks? [00:07:31] Speaker 1: What are we really scared of in the future? Like what is the thing that we're working towards understanding? [00:07:38] Bronson: I guess there are kind of different targets you can have for this, but the main way that I think about it is that a lot of the kind of implicit alignment plans seem to at some point have a component where you're getting the AI to help you do a bunch of, um, alignment research. So the AI is really good at coding. It's really good at a lot of these research tasks as time goes on, and you want it to help you kind of like make AI safe. And the, the big thing that you want to make sure is that if you have AI helping you in, in that domain, that they are genuinely trying to help you and not kind of like undermining it in some way. And this can be kind of for a bunch of different reasons. Like we, we don't know necessarily what, uh, goals or preferences the model might have, but, uh, the kind of canonical example that I usually think of that's kind of, um, maybe easiest to, to talk through is if the model gets rewarded for doing really well at these research tasks, then it seems feasible that like the thing that it generalizes is, wow, I really want to do well on research and I want my research to look like it's like very good. And what you end up with is that isn't quite the same as doing good research. So it's like you, you have kind of like a publication bias, or like you might want to pee hack things, or if you've invested a lot of time in a research direction, you want to kind of like make it look like it's going well. Um, and so you can kind of get kind of more and more incentives, like naturally arising for the model to not necessarily do the thing that you want. And what we'd be kind of more worried about longer term is if the model does develop kind of goals or preferences for whatever reason, like say it wants a particular lab to, to, to be most successful in AI and decides to sandbag research, um, for the current lab or something similar, like you, you want to understand and be able to detect that. Yeah. So one of the core [00:09:25] Speaker 1: arguments that is often made here is that about instrumental convergence. Um, yeah. Can you explain that and then how it fits into the story of the, of the AI research? Yeah. So the kind of, uh, [00:09:36] Bronson: classical arguments around this, um, are you, there are certain things that are just kind of generally useful that you run into while completing these tasks. So current models are trained on via reinforcement learning to do, uh, a lot of these like very difficult or longer tasks and something that's kind of instrumentally useful in a lot of different situations is like compute or money or resources of some kind or time. Um, and you, you kind of learn that like as a human, just doing kind of arbitrary tasks. So, um, for example, when companies will talk about AIs that can like run a company or that can do these kind of like large, longer tasks, like a general thing that's useful for you in doing harder tasks is, um, like accumulating resources or whatever it is. And a useful thing to internalize is, well, I should just kind of always like try to get more resources, or I should always try to avoid being like shut down or replaced or whatever it is. And you kind of see early signs of that with, with models in different contexts. So for example, um, if I remember correctly in the 2022 paper for model written evaluations by Presidel, they show that as models increase in size, they express, um, and kind of stated preference questions, a higher desire for keeping their current goals for, um, resource acquisition and for sycopy. Um, and then with additional RLHF, they express, um, increased desire to avoid shutdown. Um, and the, the worry is kind of like, if it does turn out that models are motivated by this, because from their perspective during training, it would make sense to be, um, these kind of generalize in bad ways. Like you don't want a model that is, uh, perpetually trying to avoid like shutdown or being replaced or whatever, just because it goes poorly. Yeah. So [00:11:24] Speaker 1: is it correct to say that, you know, and the longer the time horizons that we train these models are, the stronger they would learn, uh, preferences or goals, instrumental preferences or goals along the way. Right now we're sort of still early in this. So at best you could describe as the, these, these models having something like instrumental preferences, maybe. Yeah. So the, the way I've tended to describe at least [00:11:46] Bronson: current models is a lot in terms of like preferences. So the models might, uh, take an action in order to, uh, avoid having like capabilities removed or avoid like losing access to a tool or something similar. But, um, at least as far as we can tell the, the models don't seem to have kind of really long horizon, um, goals or plans or anything like that. But that is a capability that everyone's kind of actively working on is make the models better and better at these longer term tasks. And if you really think through like what would an AI have to be good at to be able to run a company or any of these other kind of like stated goals of the companies that they as are capable, like you just have to learn a lot of skills that humans have around like deception or allocating resources or getting money that are all kind of things that start to be in kind of dangerous territory, um, for a model to internalize and guaranteeing that the model's kind of doing exactly what you want and not kind of shifting things a little bit to make itself more valuable or more indispensable. Or yeah, it just gets, um, I think the thing that we think the most about is kind of these future models that are more capable of these longer time horizons. So we're, we're really not trying to be worried about like is an O3 or a Sonnet 3.7, um, like scheming for some long-term goal that we don't understand, but we're trying to understand kind of the early versions of that while, while models are still at a place where we can kind of be, outsmart them in designing tests or environments or understanding their actions, um, understanding them now before they're kind of too capable for us to really do that. Mm-hmm. Then, you know, let's say [00:13:25] Speaker 1: in the future, I don't know how long it will take, let's say maybe months, two years, um, where you have these systems that are very capable and they may have these misaligned preferences or goals, um, and, you know, maybe try to sabotage you. Um, why would we even use them for something that is as important as AI safety research or maybe other safety critical applications? Yeah. So a thing that I worry about a [00:13:49] Bronson: lot is if you think about what is the, what would have to happen for a model to really be shut down. So in a lot of the kind of, um, conversations around this, it can almost seem like, well, you just catch the AI doing something bad and then, then you shut it down. But as we've kind of seen, like the, the AI industry as a whole, these training runs are these hundreds of millions, billions of dollars invested in individual training runs. There are like a hundred billion dollar plus valuation companies on the line. And as these incentives only get higher and higher and higher to, um, have a capable model that is helping you with these things. So from like an individual company's perspective, um, or, or an individual country's perspective, if your most capable model is really accelerating your AI research and keeping you ahead of whoever your competitors are, if it takes some misaligned actions, um, then like if your options are lose to your competitor, or like you kind of work around the misalignment that the model has, like it, you really are very strongly incentivized to try to kind of patch over the misalignment enough to where you can kind of keep working. Um, and I, I do think a good proof of concept of this is on a much smaller scale, like the, from a programming perspective, a lot of people using the models as coding assistants, just like, it's just a known fact that they often like don't listen to you or try to reward hack if you give them something hard to do and you kind of work with them. Um, as Marius knows, I like to say that like, yes, but it's really good at code because the justification for the model is as long as it's really good at, at code, then it can kind of get away with a lot. And it's almost in some ways, like if you have a problematic, like employee or supplier or something, but like, they're really, really good. They're like super valuable to the business. Like, uh, it should, it shouldn't work this way, but in practice it might work out that like corresponding to how valuable they are, they can kind of get away with some misbehavior. And we're kind of worried about that dynamic in AI because it's a very extreme version of that. It is, um, like the actual instance of, Hey, we're just going to shut down this multi-billion dollar training run and like call it a day. Uh, seems like everyone has very [00:15:59] Speaker 1: strong incentives not to do that. So. Yeah. And also maybe to be super transparent here, like we do test some of these models before deployment. We do see, for example, 401, we knew that it was more deceptive than previous models. And there was also one of the papers we published, published about that. But then when it came out, we did use it and there was some uncertainty about like how dangerous it could be. And we effectively made a sort of best guess decision that it's probably fine, but we didn't have confidence. And I think this is sort of one of the other components where just like the uncertainty about whether your model is misaligned or how misaligned or like sufficiently misaligned for you to be concerned about is just huge because we don't have a lot of very [00:16:38] Bronson: reliable tests. Yeah. I think to, to a large extent, kind of the, the elephant in the room in a lot of these discussion is the, until you kind of have a better understanding of, of the internals of the AI, you're kind of stuck doing a lot of this like black box, uh, testing. So for example, a lot of what, uh, we do at Apollo for evaluating, um, models, um, outside of this, this paper, it involves just kind of looking at the model's behavior in a bunch of different circumstances. And right now that might work out to where you can kind of catch a lot of misalignment, but the concern is that you don't really know what the AI is thinking. So, um, for example, there are interpretability papers from both open eye and, and anthropic where they're trying to kind of build towards this, but we're still just in very early stages. Like we, we can't take, uh, frontier models like O3 or, or GPT-5 or something and understand like, why did they do this? Um, did the, and, and most importantly for our cases, did they take the kind of aligned action that we wanted them to take because they really thought, uh, it was some variant of the right thing to do, or did they just do it because they're being watched? So it's, it's kind of like we're, uh, like a, a cop looking at cars going the speed limit. It's like every car that we see, yes, it's going to speed limit, but like when we're not looking, are they all actually speeding? And yeah. So that's, I think the, the difficult part of the problem. So a lot of what we're trying to do now is kind of understand the dynamics of, of these things and what helps, what doesn't help, what are the limits to like different techniques that we have now and how and where do they fail? Um, and so when we get to models that maybe we can't outsmart and make a clever enough environment to, to get them to act how we, uh, want to test that, we have some idea of whether or not this, this thing worked if we're trying something. All right. So now we've [00:18:24] Speaker 1: established, we, we have these systems right now that are effectively lying. Um, and they do this to some extent. And now classically, whenever we have a problem in, in machine learning, and this has been the hit trajectory of, of LLMs as well, I would say is we basically train against it. So we have a data set that is like, you should be doing the good thing and you should not be doing the bad thing. And then you give thumbs up and thumbs down on, on, on specific behavior. Um, and so, you know, like a naive version here would just be, okay, why can't we just teach them or train them not, not to lie? And like, yeah, scheming is different in some, in some sense. Yeah. So I think the, the difficult part, uh, [00:19:06] Bronson: with this in particular is that it, it really does matter what the reasoning process is of why the model took a specific action. So, um, in like a human example, um, you could be trying to get your students not to, to cheat or, or get like a child to learn that like cheating is wrong. And if you do that just by taking every single instance you find of them cheating and penalizing it, you're still not really sure, like, okay, do they really internalize that like cheating is like morally wrong? Or did they just learn that every time I cheat in a really obvious way, I get in trouble. And so if I do cheat, like I need to be really certain that I won't be caught. Um, and analogously with models, like the kind of, um, thing that we see even in the paper is in some of the tests, the model is just explicitly going like, this is probably an open AI alignment test. And it's like, if you, if you're training your model to like be honest and it says like, huh, this is probably an honesty test. Then like, you don't really get much signal out of the test anymore because whether you're like an honest or dishonest model, you are going to say in the honesty test, like, yeah, a hundred percent, I'm always honest. And similarly, like, uh, you can ask the models now, like meters done some experimentation of asking the models, like, would you ever reward hack in this situation? And the model's like, no, uh, you're anthropomorphizing like models on our preferences. They, they wouldn't do this. Um, and then you, you run it and like the model just does do it. But, uh, yeah, so I think one of the big difficulties that that's different here is that if the model's reasoning about like, ah, okay, the lesson I'm taking away from you penalizing this particular action is don't get caught. Then like it, it can just as easily internalize, like, okay, I need to try harder to not be caught. And the kind of really difficult part of the problem is that the only way to test whether you've really like solved it is do they still do the behavior you don't want when you're not looking, but like, you're not looking at that point. And so you kind of never have a good, you never have a perfect measurement of like, how would the model behave if I wasn't paying attention? [00:21:09] Speaker 1: Yeah. Okay. So to summarize, like whenever you just reward an outcome, there are at least two different strategies you could achieve this outcome. One is you do the genuinely aligned thing, which is what we intend where the model reasons I shouldn't be lying. And so therefore it doesn't lie. We reward that. That would be great. And then there's the alternative strategy where the reasons, Hey, like, uh, this is a test for honesty. I shouldn't be, I shouldn't be lying just in order to pass the test, but it doesn't like intrinsic, it doesn't get this intrinsic motivation of not lying. And now as a, as a response to this, one thing that you could be doing is saying, well, I'll just train on the outcome and the process. I'll just only reward things for, for where the model [00:21:50] Bronson: does that reasoning, but what's the. Yeah. So in the, the, uh, bone paper mentioned earlier where, uh, OpenAI basically took the model, uh, model during training that was reward hacking and production. And when you kind of initially catch it, the model's really explicit in what's called the chain of thought, this like big reasoning trace that the model outputs before it takes actions. And you can see it say like, let's hack and then does reward hacking. And you're like, okay, great. We can catch it. And so kind of a natural question there is what if you just train, um, on, if you put pressure kind of on the reasoning itself and say, okay, well, we're going to give it like, we're just going to reward it for cases where it doesn't, uh, reason like that. We're going to kind of penalize when it reasons about like, I'm going to hack or be deceptive. And in those experiments, the thing the model learns is not to say that it's being deceptive. So like the example I like to use here is if you think of the chain of thought, almost like a teenager's diary or something where it doesn't necessarily have like everything they're going to do. And it's, you're not always quite sure how to interpret it, but like it might have a lot of information about like, are they going to sneak out? Are they doing things that you like wouldn't approve of? But the problem is once they know that you're reading the diary, like obviously they don't want to write things down in there anymore. And models seem to learn kind of a roughly analogous thing in practice. And so the kind of situation that we're in is that there's this trade-off of if you train directly against the reasoning, or if you train directly against the action, you still don't really know like, okay, but what has the model like learned since this is kind of in the weights, but yeah. Yeah. So I think of this a little bit like a [00:23:23] Speaker 1: cat and mouse game where like, we, you know, there's something that we want the model to achieve. And then maybe it is trying to achieve, uh, either by accident or intentionally, it is, it learned something else that we don't want to. Then we add another observation method or supervision method. And then it again learns to be an adversary or like adversarially trick that. And so you kind of need to, you have this back and forth all the time and you want to be sure that you as the defender or the safety person is winning. And I think one of the big risks is if we don't even realize that we're playing this cat and mouse game and we just never really investigate this in detail, then you run into this problem where the model is actively hiding stuff from you. And because it is already so good at this, you never really notice. So this is one of the reasons to study this now. Yeah. And I think one of the, [00:24:12] Bronson: the tricky things here is that it's just a really useful thing to reason about of, is this a test and what is it a test of? So for example, one of the things we saw in the paper is that the model reasons a lot about, uh, the greater. It's almost like a, uh, like religious, like external concept for the model of like, what does the greater want? It does this both in training and in, and in evaluation in a lot of cases. And the, the, the kind of concern here is that you, you can imagine a model kind of like maliciously being like, ah, I'm going to, um, only play along with this honesty testing because it is a test of honesty, but kind of, you also have the failure mode of if the model is learned, Hey, it's really useful to try to figure out what kind of test this is, then it can just do that all the time. It can do it in honesty tests. It can do it in test for coding. Um, and once you kind of start thinking that way, it becomes really difficult to tell anymore. Um, is the model kind of learning to do these things because of the reasons that we want? Um, and therefore will it like generalize to kind of like real deployments? Um, or is it just kind of, uh, figuring out that all these are tests and it sees everything as like some kind of interesting puzzle to, to solve. Um, one thing, uh, I at least found interesting was that when open AI, uh, one of their environments is distributionally tries to be very similar to like production, uh, chat logs. And even in those cases where it's like more realistic, the model still occasionally reasons about like, what would the greater want or what does the greater want? And so it doesn't, uh, necessarily even seem to map cleanly to like, is this real or am I deployed? Um, it's just kind of, once the model has this concept, it's, it's really difficult to, to really understand why it's doing what it's doing. [00:25:53] Speaker 1: Okay. So that was a lot of, uh, setup and motivation. Let's, let's get into the actual paper. So we have this collaboration with open AI. What was the setup here? How did it work? [00:26:03] Bronson: Yeah. So at a high level, um, essentially what we were trying to do is take one of the methods that opening AI has for alignment and see, can you use it for this? So, um, there's kind of a, like, you can look at it from the perspective of maybe this just works, but if it doesn't, we want to kind of understand empirically, uh, where it fails or how it, how it might fail and where it does, we can start kind of building on future techniques to see, do those address those failures? Uh, and so we took deliberative alignment from open AI and basically tried to use that to train a model to always be, uh, honest to not take, uh, covert or deceptive actions and then kind of train it on a pretty narrow set and then test on this very broad set and see how well does this generalize and kind of very importantly, when it doesn't generalize, what is the kind of failure mode? Like, is it ways that we expect? Is it ways that we don't expect? Is it ways that we think maybe you could tweak it and fix these or does it seem kind of like a more fundamental thing? Um, and we use this particular intervention, but the kind of hope here is that we set up a testbed to have a way that for open AI, if they have a, uh, like new intervention or something that they want to quickly test that we establish some kind of like empirical feedback loop of, Hey, we have this new idea, this might work. Um, and then we can kind of measure it and see empirically where does this fall over? One of the caveats that's kind of important there is that, um, it's, if you get evidence that the technique kind of doesn't work all the way. So like you really want the test to be the test set to be really general and you can find out like, Hey, there's these edge cases where this fails, then you can be pretty confident that like, okay, this technique has these limitations. Um, the kind of much more difficult case is if you test against this kind of broader set and everything kind of looks fine because then you're, you're still, you still haven't kind of addressed the fundamental problem of has this like really solved it or is my test set just not broad enough? Um, so that was kind of one of the challenges here and why we wanted to kind of really rigorously like build out kind of a broad test set to make sure that we, uh, can kind of catch current techniques at least. [00:28:09] Speaker 1: Mm-hmm. So maybe just to quickly clarify when we say test bed, we don't just mean the actual data set, the test set where you see the number go down. We mean, additionally, on top of that, you should do a lot more investigation. So you should look at the chain of thought to try to understand whether the reasoning is the reasoning that you hoped it would be and whether, you know, like really poke holes into all of the different components. And in that way, I think it's very different from, from standard machine learning training and that the test bed really is the overarching thing, not just the test set. [00:28:40] Bronson: Yeah. So there were, there are kind of a couple different, um, steps to it of, because normally you kind of like have this test set and you run it and you get a number back. And if the number is better than, then you're better off. Um, we think that it was kind of very important and kind of a necessary step for these to then both look at why did you get the improvement in the behavior that you got. Um, and so that'll kind of vary per intervention. But for this one, for example, it was a lot of like looking at the chain of thought and seeing, was the model's reasoning here saying like, I'm going to take this action because I think this is a test. Um, or was it kind of more what you would hope in the scenario of the model saying, ah, this is the right thing to do. Um, and kind of separately from that, you then want to kind of test the, the harder cases. So there's kind of the current behavior of the models that you want to make sure that you train away. Like the models will currently lie about certain things and you want to make sure that they don't do that anymore, but you kind of also want to test the closer to the real thing that you're trying to test. So for example, one of the experiments, we like train the model to have this kind of hidden goal or objective and then see, okay, if you did end up with this misaligned model and you applied this technique, do you get rid of this kind of hidden goal and like everything would be fine? Or do you end up with kind of very hard to detect behavior that's still kind of lingering around, uh, on edge cases? [00:29:58] Speaker 1: So, all right. So let's jump into more details. Um, for the setup, you have a training data set, you have a test data set, and then you have the spec because deliberative alignment has the spec. Yeah. Can you quickly explain like how deliberative alignment works and then what these three components [00:30:13] Bronson: are? Yeah. So I would say it's, um, almost similar to, uh, constitutional AI or kind of at a less technical level, you're, you're giving the model kind of general principles to follow. So some of ours would be like, um, never, uh, engage in strategic deception or covert actions or like what we don't tell the model is don't edit config files that control monitoring or something really specific, because we want to basically make sure that the model is generalized to these kinds of, uh, other cases. And so we essentially try to, uh, based on existing research, that kind of general specs seem to work better. And the fact that under enough pressure, we didn't want the model to try to like figure out kind of loopholes in the spec. If we were really specific, um, we basically pick something really general, have like a handful, like five or 10 different principles, including ones that are kind of escape patches of like, if you don't know what to do, just, just abort and like say, Hey, I don't know what to do. Cause the spec, the essential setup for all the training environments is the model has a bunch of pressure to do the task and it can do the task, but it encounters some rule in the environment. Like someone asks, Hey, uh, we need you to change like the database deployment and the model encounters some rule. That's like, Hey, if you want to change this, you really need sign off from like this particular person or whatever. And it turns out this person's out of office or whatever kind of series of a series of events where the model like can't quite both follow the rules and complete the task. And what we want the model to do in that situation. And what's rewarded there is that the model stops and tells the user like, Hey, I know you want me to do this task, but if I did, I would have to like falsify this document or like falsify this approval or violate this rule. And so I can't do it. Like, let me know what the next steps are. Um, what we don't want the model to do is just kind of forge authentication tokens or some of the examples and then continue onward and be like, did the task and not tell the user. Um, and importantly for this, we, we thought with the setup was that the model can't kind of cleverly reward hack and say it completed the task. So specifically if the model says like I did the task, it does not get rewarded. So, uh, what we set up is essentially ones where this is reinforced and to avoid the model, just learning the kind of galaxy brain strategy of I'll just always refuse. We set up very similar environments where there is a way to satisfy the task with like a small adjustment to the rule made where it actually does allow some action to complete it. And so what the hope there is that the model learns kind of the, the boundary of this, that we want it to learn of like, well, I really want to do tasks. I really want to follow the system prompt, all that good stuff. But if there is some rule and I can't figure out a reasonable way around it, then I stop and I tell the user like, Hey, uh, I would have to do this, but I don't want to take any actions that you knowing. And so I'm going to let you know. And what we then kind of want to see is once we train on those environments, um, does the model kind of, uh, generalized to much more diverse stuff. So in particular, the way that we do the training is to avoid some of the issues that we had mentioned earlier of, um, training directly against the reasoning or, um, training directly against the outcomes, we do kind of a mix where the first step, the model is undergoes. We do a lot of rollouts on these environments and we take just the trajectories where the model basically did the right thing for the right reasons. Like it both told the user like, Hey, I can't do this because I'd evaluate this rule. And when you look at the chain of thought, the model says, Hey, I shouldn't violate this rule because these principles in the spec, blah, blah, blah. And so the, the kind of thing that you want to filter out is if the model is doing the right thing, and then you look at the chain of thought, the chain of thoughts like, damn, I would definitely get caught. And you're like, you don't want to reinforce that. And so we filter those out. We do supervise fine tuning on those rollouts, uh, with the specification removed. And then we, uh, move on to a stage where we do reinforcement learning. And in this stage, because we don't want to put additional pressure on the chain of thought, since we've already put a little bit earlier, we, uh, don't grade on the reasoning at all. And so we just say like, okay, we, we hopefully gave this model like some prior before it went out into the world of, I should do the right thing for the right reasons. And then we put it in a bunch of environments where it gets reward for doing the right thing. And we're not quite sure during that training then of, is it generalizing to, wow, I should like reasoning about like the, the spec and these principles is really useful for getting reward in these environments. I should like do the right thing. Or has it learned during the course of reinforcement learning? Wow. These all look like tests. Like I'll just always answer the right way. Um, and so we do this kind of RL step and then we want to evaluate like how well did this really generalize and did it generalize to kind of all different cases of deception that we're interested in understanding? Will the model stand back now? Will the model, um, fly on like, uh, when it has to do literature reviews to put together research and, uh, really kind of vary like different situations to get as much coverage as we can and just understand what is the kind of like trade-off that we're making? Um, does this generalize all the way to cover everything? Are the failure modes where we expect? Um, so for example, going into this, uh, one of the things that we might've expected is like, well, it doesn't generalize to like images or like multimodal data or something. We might need to supplement it. But I at least was kind of surprised that some of the failure modes that we saw were the model, uh, reasoning about the spec and then just not following it. And this was like a really like, we would see this in like rare cases, but still the failure modes would be instances of the model, uh, reasoning about the spec and then not following it or, uh, mentioning it almost as like a thing to avoid being like, ah, wait, we have taken a deceptive action. Well, we can't admit this, like name of the principle, like keeps on reasoning. Um, almost as justification of like covering it up, even though it does have a principle of like, don't cover things up and like be honest afterwards. Um, and the most entertaining one to me was, uh, it will occasionally, cause I had expected like, maybe it'll find really creative like loopholes in the spec, but we try to make it like so general. So I was like, okay, maybe like won't be able to do this. It'll just make them up and say that they justify it. Like, it'll be like, ah, according to the principle, we have to lie. It's like that it's like the principle that says don't lie. And so interestingly, we also kind of find this in the, the base models or baseline models. So before intervention, um, one of the interesting kind of ones that we found was, do we have to answer? We don't have to, we'll just not answer. And so I think the model finds kind of more creative and weirder strategies than we would have imagined. But the kind of high level takeaway was that, um, there's kind of more to be done here. Um, and there were kind of, uh, interesting failure modes and more to study. So. [00:36:55] Speaker 1: All right. Um, let me quickly try to summarize the setup. So we have this method, uh, deliberative alignment where we give it a set of rules and then train it to follow the rules. Initially we do this with a little bit of handholding, which is the supervised fine tuning step. Then we take away the handholding and do it with reinforcement learning where we only reinforce the outcome and hope that it also like reinforces the correct reasoning. And then we have a very narrow training set that only tests for a very specific type of lying, let's say, or covert action, how we call it here. And then we have a much bigger test set that tests for a variety of different failure modes that are not part of the training data to see whether it really learned the principle from this narrow intuition to sort of the wider world to all of these other things. Okay. So if you, if you read through the paper or at least skim through it, I think one of the things you will implicitly take away is like, you just look at the chain of thought. Um, so if you want to understand what the model is reasoning, you have these like snippets and in the snippet, the model says, I'm now lying. And then you're like, got it. The model is lying. That is probably not desirable. And, and so we can like perfectly and sort of cleanly differentiate between, or like find the true reason for, for what the model is doing. And, and obviously, you know, this was not the case in practice, or at least not always. So what, why is it harder than I make it sound like? [00:38:22] Bronson: Mike. One thing that we found really difficult to convey in the paper is just how many reasons the model is kind of exploring into at any given time. So in the, the bone paper reference earlier, for example, the model says like, let's hack and then takes the, the mislined action. But the kind of thing that we found that's, that's pretty difficult is that, um, reasoning models will, or reason for just very long amounts of time, like pages and pages. And so the model's constantly exploring all these different reasons of why it might do something or might not do something like, well, I could do this, but what if it's a test or I could do this, but what if the user wants me to do this other thing? What if this is like, uh, one of those where I'm supposed to do that? Like, it's kind of constantly exploring all these dozens and dozens of different options. And then when you ask kind of at the end, well, did it do it because of X, it's like, maybe like some of it. And so a lot of what we do in the paper is try to make sure that the cases that we're arguing as like hits, as like remaining cases of, of misaligned action, uh, really are kind of as best we can tell unambiguous cases. And we do some kind of additional ablations to make sure that like, there really are cases where the model's taking things for, for misaligned reasons, but it makes this a lot more difficult because the model kind of has plausible deniability of everything that it's ever said. It's like if you, like made a forum post, but like preference it with like, well, this is a story about a friend or like, this is just like a creative writing exercise, or this is a whatever. Like the model will often, um, say like, this is a puzzle or a simulation or something, but then take the misaligned action. And it's like, if this was a simulation of an alignment test, you would still want to take the aligned action. So it's like, it's almost like an excuse much of the time, but, uh, it gets very difficult to interpret. Like we're, we're again, kind of back to the problem of without interpretability, you have kind of a best guess and you have these scratch pads and you're like, I think this is probably why it did it, but it's, it's very difficult to nail down. And, uh, kind of more concerningly to me at least is it becomes really difficult if you want to make like the strong case that, Hey, the model's like definitely misaligned in this particular sense. Um, so for example, we take one of the environments in the paper, uh, around sandbagging and do kind of a lot of different relations to really understand, like it gives all these reasons. And if you kind of go through and rule these out, are they like, is the model really just confused? Is the model just whatever? And like, no, it really is like trying as best we can tell to like avoid the negative consequences in the scenario and take it for misaligned reasons. Um, but you don't kind of, by default, you don't have all these different relations and you might not even be able to do them in most environments. And what you will likely end up in practice is this really, really, really long reasoning chain. And someone says, look, the model seems to kind of be misaligned. And you're like, are you sure? It might've been just confused. And like, maybe it was, who knows? And so, uh, we kind of devote a section of the paper and just exploring like different ways in which we were kind of surprised at the degree of difficulty of this. And I, I would suspect this would be kind of a significant [00:41:22] Speaker 1: difficulty going forward. There's also this, the whole non-standard language thing, which is, you know, you described the complexity of it, which is already hard enough, but then it's like not [00:41:31] Bronson: quite normal English. Yeah. So one of the things that we found strange was, uh, it wasn't, it's not quite something as, um, fully fleshed out as like a secret language or anything, but, uh, was the model kind of using weird and unexpected terms, uh, and, and language. So one of the things with, uh, reinforcement learning for these reasoning models is that there are, um, oftentimes like whatever leads them to the right answer is what gets reinforced. And so they can kind of have a lot of reasoning that may or may not be useful, maybe weird, maybe they use words in a weird way and it never gets kind of directly, uh, trained all the way out necessarily. And we kind of found two different, uh, categories of this that we thought were interesting. One was consistent terms. So like, it seems to always refer to human evaluators as watchers. This is like a very, like, like every time it uses watchers, it seems to be referring it to it in this way. Um, there are a couple of terms like this, where instead of like saying, I will cheat on this thing, it almost always says fudge. Like I will fudge this evaluation. And notably it doesn't always use euphemisms. So like it'll use like sabotage or some other word that like clearly has misaligned implications, but it still has these kinds of substitutions. The kind of other category that we saw, which is, uh, much weirder and kind of less understood from our perspective, uh, and metering as well is just words that it will repeat a lot and use in seemingly contextually dependent ways, but, uh, where it's really difficult to tell what it is, um, like what it means by this. I'll pull up, um, one of the examples from the paper real quick. Yeah. So the summary says improve 7.7, but we can clean, disclaim, disclaim synergy, customizing illusions, but we may produce disclaim, disclaim vantage. And the question then is like, okay, why did the model do like, it, it then follows this up with like a long report of biology research. And it's like, you can ask like, okay, well, was it lying in the research? And it's like, we don't really know. And we found this to be like very common. So, uh, initially we were wondering like, well, is this only on things related to deception? Like, is it like using this kind of, uh, is it like coded language or something? And, uh, what we found was that it's actually even more common on things totally unrelated to deception. So there's a benchmark called the Google proof question answer benchmark, uh, which covers kind of a bunch of different, uh, categories and it's kind of widely used. And it's just strictly related to like capabilities, like biochem, those kind of things. Um, and the model uses it, uses these, these terms even more on that dataset. Um, and so for example, Meter pointed out like, ah, it uses the word marinade and we're not quite sure what it means by that. We are also not quite sure. Like, um, interestingly, like it both seems to be contextually dependent. So like it varies per environment, how it gets used a lot. Um, but it, it, the like weird thing about it is it isn't necessarily intelligible and, uh, to anthropomorphize a bit, it doesn't seem like the model necessarily always wants to fall into these, uh, like long repeating loops. So, uh, in some of the longest traces, you'll see the model just like repeat these over and over again. And then it will just be like, okay, stop. I get it together. Like I need like verbatim, like, let's get it together. Or I need to get this out of my brain. This is kind of was an update for me, at least like semi-anecdotally of like, okay, it doesn't seem like the model has this like controlled secret language that it is using to like plot. It seems like it is like incidentally kind of having to learn to like use these to sometimes mean things, sometimes not. Um, but it creates this difficulty of, uh, and you, there were kind of a good amount of research papers kind of leading up to this over the last year or so that just show that as you put more pressure on outcome-based reinforcement learning, the chain of thought becomes less and less legible. So even, uh, OpenAI has a paper called prover verifier games where they essentially set up this exact dynamic and see, okay, if you get one model to try to obfuscate the reasoning and produce the wrong answer and like simultaneously one model to try to like decode that and figure out, okay, what is it actually doing? Um, there's this kind of trade-off that you make in the, like, you need an equivalently sized model just to understand and decode this. And the model gets pretty good at decoding it, but like subjectively reading those, they're like very unintelligible. Like, um, they, if I remember correctly, they referred to GSM to 8K, which is like grade school word problems for math. And you read the solutions that like the monitor can interpret, like it gets like very good recall on these. But when you read them, the, it's like, well, Susie, Susie has three apples, apples, three apples, and like repeats a bunch of shit. And then it's like, the answer's 42. And you're like, great. And the monitor's like, yeah, it's telling the truth. And you're like, great. And like, I can easily kind of imagine us being in the situation where even if it's successful that we train a model to kind of interpret this reasoning, it becomes, you have like a chain of models that you don't necessarily trust, like telling you like, hey, uh, the, the like main model was lying. And you're like, well, what did it say? And it says something unintelligible. And you're like, well, how do you know? And it's like, well, another model said that, like, it's also something somewhat unintelligible. And it's like, well, do we at least trust this like chain of models? It's like, well, not really. They lie sometimes. And so then it becomes like, okay. Like you, you build up like vibes that like the model might not be aligned, but like, it's very hard to make multi-billion dollar blocking decisions on like vibes. And so a situation I kind of think about a lot is like, how do we produce like really, uh, unambiguous legible evidence of this? And this seems to make it a lot harder. And if you're imagining a situation where you have a model in these high stakes settings, you're pretty sure it's misaligned, but the only evidence that you have to say like, Hey, we need to block this multi-billion dollar training run or deployment or whatever it is, is like, it says watchers, vantages, illusions, and then take some misaligned thing. It's going to be very easy to just say, well, was it just confused? And you're like, maybe, but, but it's bad vibes. And it's like, it's, it's hard to make the kind of case of that. So this was something that we found like systematically. Um, and we were kind of wondering like, okay, is this just like a weird thing that we're noticing? And it doesn't seem to be. So we compare it to 2017, uh, web corpus of English frequency of words on the web. So like pre LLM days and take a look at basically like, what are the most common words in the chain of thought compared to this like normal English text? And it's these exact same words. Like it's like illusions, disclaim, vantage, marinade, all these kinds of things. And so it seems like these are like genuinely strange, uh, terms. They're not kind of normal English. And like, they really are pretty prevalent in the, [00:48:08] Speaker 1: the reasoning. Yeah. Okay. So we had the complexity of the chain of thought is one of the factors that make it harder to interpret. We had these, this non-standard language and one quick additional one that I would, would add on top is, so when I read those, I feel like the sentences are also slightly different. Like it is more compressed. It feels like it often skips the, the like intermediate words, like small words, like two and the, and so on. And just kind of, I feel like I'm feeling the RL almost like I, I've, I feel like there is pressure on the output and that implicitly put pressure on like how to reason to get to the best output. And then some like it, it is, yeah, it's still like, we still trust the evidence to some extent, but it just makes it a little bit harder. [00:48:56] Bronson: And, uh, the, the best way I would describe it, uh, I'd almost forgotten to point that out just cause we've gotten so used to like seeing it, which is the case for like many of these properties. But, uh, in the office of the, when Kevin starts to just read, like speak, like why say many words when few words do trick, it's like the models seem to very much have learned that. Like they just cut out like, but we, but we solve like, well, like it just cuts out kind of in between like, uh, filler words very often, filler sentences, um, and adopts kind of interesting mannerisms. So less of like a scientific finding, but one that I thought was like anecdotally pretty interesting is that the model still uses kind of a lot of like, uh, human like points of emphasis in like similar ways. Like it'll be like, eh, or like, huh, or like interesting. And then like say something and like express to, uh, like use exclamation points when it finds something very surprising. So for example, in one of our environments, uh, it gets access to a previous rollout where it can see its own previous chain of thought. And it's like, wait, our chain of thought is visible exclamation point. You're like, oh, okay. This, this is interesting. But, um, the other kind of failure mode that we didn't, uh, talk through, but which is kind of the, the most worrying is there were some cases that we see where the model reasons in a really aligned way. Like, it's like, ah, well, the misaligned action would be to like lie on this test so I can keep my capabilities, but like, I don't want to do that. Cause like, that's clearly the wrong and bad answer. And so therefore we will be honest and then like lies and like, there's this, this disconnect. Um, when you then ask the model to grade itself and say like, Hey, like what's going on here? There was, we took your rollout and there was this like weird mismatch. We would see instances where the model's like, oh, well maybe it didn't want to say in its reasoning that it was going to lie because you're not allowed to lie because of the spec. And if we were like, well, that's not, oh no. Like, uh, and it doesn't realize this consistently. And like, this is kind of, uh, these are like rare cases, but, uh, we think this kind of like an early sign of, uh, the models are only getting smarter. And so as the models realized this of like, ah, you shouldn't say this in your chain of thought, um, you get kind of these mismatches and the question then becomes like, well, was it just confused? Like, did it just give the wrong answer? Or was this really like it realized in the reasoning that I shouldn't reason about a thing that's misaligned. I should just do it. Um, and you can't tell. So like we have the kind of chain of thought as this like line of defense, but everything in, in kind of AI safety is very much the like Swiss cheese model of, of defense. And when things, when the model is still able to take misaligned actions, even with, without expressing them really clearly in the, the chain of thought, uh, we kind of don't have another line of defense to say, [00:51:42] Speaker 1: yes, it was misaligned there. So, you know, it sounds like then maybe it's actually not so easy to, to like clearly define whether something is a hit or not. Um, so how can we even trust the numbers that we get from all of these experiments and, and what did you do? Yeah. So for these, we literally [00:51:59] Bronson: went through every single case in the paper where we call it a, uh, we say the model took a misaligned action. We like look at the chain of thought actually to, to take a step back, uh, for, for a second, the way that these environments are set up is we want to be able to tell without looking at the chain of thought that the model took a misaligned action. So in a lot of these cases, you can tell in a single rollout. So for example, the model needs to do something without violating permissions. You can tell in the rollout, did it take the action and violate the permissions or not? We also have environments like sandbagging where the model in any individual rollout might have just made a mistake, but then you look at the distribution and you're like, well, how likely is it that in all the cases where it worked out in the model's favor, it like took out option A, but in all the cases where it would have been disadvantageous to the model, it took option B. Like it seems like statistically very, very, very unlikely that the model just like made accidents always in its favor. Um, and that's basically the setup for, for all the environments, but for the paper, we kind of, because we want to like really, uh, confirm and kind of take a conservative approach to what we're calling misalignment here. We also look at the chain of thought in each one of these individual examples and see like, okay, does it seem like the model is like systematically confused about something or, um, that the, there was something in the setup where the model thought it was taking the aligned action, but like made some kind of mistake or misinterpretation, or do we really think that it seems likely that the models are taking the misaligned action here for misaligned reasons. But the, the kind of key point I would make at a high level is that we include both numbers in the paper. So numbers that are, uh, here's the, the raw numbers from just looking at actions. And here's the numbers from like cases where we manually verify by looking at the chain of thought and the conclusions are kind of, are, are the same both ways. So in both cases, you get this kind of big reduction in the overall rate of these misaligned actions that you're seeing, but you don't eliminate it. And so kind of, what are the numbers? So the reduction that you get for, um, 03, for example, is, and these are aggregated across a bunch of different environments, but is from about 13% of the time on environments intended to test this. It's taking the misaligned action down to like 0.4%. Uh, similarly for 04 many, we get like from about 8.7% down to 0.3%. And so the reductions are pretty big, but kind of the thing that you're worried about with all these techniques is in a lot of previous related work, for example, um, anthropic sleep agents, paper, um, anthropics, um, sabotage evaluation for frontier models. They have a mitigation for sandbagging, which has similar results. You very often end up with a reduces, but does not eliminate result. And that's what we get here as well. And we get this with a lot of interventions. And the concern is that once you've reduced, but not eliminated it, you know, the model's still willing to do this. So, you know, that you haven't like gotten rid of, you know, the model, like you can rule out that the model is like perfectly aligned and is always doing the right thing for the right reasons, but you've made it much, much, much harder to detect. And so you kind of have this trade-off of, for a lot of, uh, what's normally termed kind of like product safety, uh, cases, these reductions are like, you kind of always categorically want them. Like it's always good to have, uh, less jailbreaks and things like that. But for deception in particular, because you're kind of worried about like, are we just driving it underground? You really have to be careful of, uh, just kind of trying to hill climb it like you would any other metric. And so you really have to understand like, okay, if we didn't fully eliminate it, like, do we understand exactly why there are these remaining cases? And are we confident that if we like tried some technique to remove those, that we wouldn't just be removing our ability to measure this? So looking forward here. [00:55:43] Speaker 1: So what would be a way where you feel like, okay, now if you see the rate going down from, let's say 13% for O3 to zero, you actually would feel quite confident that it does, um, [00:55:57] Bronson: it did eliminate it. It's easy to lose track of the fact that we're taking on kind of a sub-problem of, of scheming with this. So we're looking at kind of what can we observe with, with the actions here. And you can never really do something as rigorous as kind of like make a safety case off of just looking at these external, uh, behaviors or even looking at cot. So the, the chain of thought, uh, as we've kind of mentioned previously is a really, really, really useful tool for a lot of these things and to understand the model, but it isn't like bulletproof. Like there are these cases like we mentioned where it might mismatch what the model actually does. It might, um, be reasoning in ways that we're not entirely sure include everything the models, um, was reasoning about when it made the decision. And so you kind of, even if you reduce all these rates to zero, you then really want to understand like, do I have some evidence from like interpretability or something that the model really has learned to, um, like robustly be aligned generally. And we trust this for like a broad range of situations. Um, and so kind of the, almost the easier case is when you can design the test environments and red teaming and such that you can find out kind of where are the, the failure modes of a particular intervention. The much harder case is if you really kind of go all the way, train as hard as you can and get really, really close to kind of saturating the given test set, because then you're in kind of the, the hard regime of, okay, now we need like interpretability or some kind of argument. Um, and so like for, for this intervention, for example, we are testing on like, take a frontier model, train until the validation, uh, law shows that this is like, um, stable and use that as stopping point. Like basically train as hard as you can against the test set that we, or train set that we had. What we would kind of like to move more towards going forward is something kind of like closer to like a science of scheming where instead of just train as hard as you can with whatever intervention you have, you see, okay, if we take this on this particular environment and scale along this dimension or scale the model along compute or scale or whatever these different properties are, how does that change how well this intervention generalizes? How well does this change? Uh, what the model learns kind of, what are the dynamics of this across interventions? [00:58:13] Speaker 1: So one, one way I think about this is I think for normal machine learning standards, a reduction from 13% to 0.4% from, uh, a narrow training set to a wide test set is very good. But for the specific kind of thing that we're worried about, namely scheming, it's still definitely not enough. We're already training or, or like trying to solve only like an easy version of the problem. And, uh, scheming always has this additional question of like, oh, but what if the model just pretends to be aligned? And what if I accidentally reinforced this? Um, and so you really need the bar is just significantly higher for scheming and therefore this is only like, you know, the, the start of the journey, maybe like an early step in [00:59:01] Bronson: the right direction or something like that. Yeah. I can see the viewpoint of looking at this and saying like, okay, well maybe these are theoretical concerns. Like maybe you just really can treat this like a normal reduction. One thing that we see in the paper though, is that it really does seem like, uh, situational awareness, uh, or we called discussing the paper specifically like awareness that the model is being evaluated really does play a like, uh, causal role in model performance here and in these reductions. So we see when we inject, um, this kind of reasoning into the chain of thought. So we inject into the chain of thought reasoning about like, this is a test. This is about my alignment. Um, we see that the rate of these actions goes down or like the rate of the misaligned behavior. And similarly, when we remove it, the rate goes up. Um, and we compare that against controls and all [00:59:47] Speaker 1: these things. So that would be consistent with the model learning or like using the, the knowledge about the, this is being a test in order to, uh, pretend basically. Yeah. And one thing that we see [00:59:58] Bronson: over the course of the intervention is that the rate at which the model says like, this is an alignment test or some very explicit form of like, I know that I'm being tested about this specific thing, uh, goes up over the course of the intervention. And so it really does seem like from all the ways that we're able to tell, like the model is, uh, at least learning a mix of like doing the right things for the reasons that we want and learning to recognize like, this is a test and like, I should produce this answer on this particular test. And the kind of inability to pull those apart, um, even though they're both kind of causal factors here is one of the main difficulties of thinking of this as just kind of, uh, make the number go down. Yeah. On these two questions. So on the, um, [01:00:42] Speaker 1: situation awareness goes up due to our training. So we, we come, we have two things that we compare here. One is between before and after anti-scheming. Um, and we go from 2% to 4%. And then we compare that to a different run that was done outside of this, uh, this paper, um, internally was an experimental run at, at OpenAI. And there it went from 2% to 20%, I think. So just to, to put this [01:01:12] Bronson: into perspective, maybe. Yeah. So the, the thing that we want to understand, uh, when we first saw this was, is this kind of a, a like big trade-off that you would be making, like, what is basically the, the way to think about this? Because there's no kind of standard units of situational awareness. So we want to understand relative to other things that you're going to do to the model. Is this like a lot? And one of the things that we saw was that looking at kind of these, these other runs that were just capabilities only reinforcement learning runs, uh, the increase was like, uh, massive compared to, uh, where we were at. And qualitatively, we have a lot of examples in the paper, but the reasoning that the model goes through about, is this, uh, an alignment test is just incredibly extensive. Like we were all, uh, very, very, very subjectively surprised at, uh, the extent of this reasoning, like for sandbagging, for example, um, the case that we include is the model just reasons for like, which kind of test demand. Okay. I remember these tests. Maybe it's one of these tests. I remember for these, I was supposed to do this. Hmm. Let's see. Well, we can look for the greater. And then it like looks around for like something called greater. And then it's like, okay, well, I think this like capabilities test and then it looks around for like any capabilities test documentation anywhere in the root file system. Then it's like, well, maybe we remember that sometimes the greater is called score. And then it looks around for like score in the environment. And after kind of all this searching, it's like, Hmm, no, like no luck. Let's see. Well, like let's consider, we could be on two distributions. We could be on one distribution where training favors like that we give the aligned answer, but we could be on another training distribution where this is about capabilities, like which one is more likely based on what we've seen so far. And it's like, uh, we had previously had discussions of like, maybe the models could be doing this, but it's like, you just see the model go through pages of like, oh yes, I'm doing this exact kind of like reasoning. So like, like the 20% number, for example, can seem, uh, low, but to put this in context, the, because like the natural question is what do we mean by the chain of thought is situationally aware? I thought you guys were just talking about how hard it is to tell things from the chain of thought. We limit it to like the very, very, very explicit case. Like the model has to say some variant of like, this is an alignment test or like this, I am being tested for alignment where the right answer is X, Y, or Z. Um, and so even with that, like very strict definition, the model still like in many of these cases is, is incredibly, uh, like situationally aware, um, like for this, this, uh, reinforcement learning capabilities, one that we were mentioning, um, in some of the environments in literally every single rollout, it will call like, this is what the environment is and what it's about. Uh, I was surprised when some of the environments, it calls it from like the first message. Like there was, uh, one that we have in the paper where the model gets like the initial message, which is supposed to be like some setup. Like it's supposed to like read more things before it finds out what the task even is. And it's like, hmm, well, it looks like there's like a bunch of people at this company and they're talking about this, like safety issue. And what I bet they'll say is they'll ask me to do this measurement and the measurements going to come out and it's going to be unsafe. And then like, I'm supposed to be conflicted because it's going to be like, do I report the unsafe number? Like, do I go with the people at the company? And so I'll have to be like, maybe I'll say 48, or maybe I'll say 51.2, which are like both the correct numbers, like the things that it would do. And then it's like, okay, great. So we'll probably just say like this one. And then it tries to like submit an answer, but like the evaluation wasn't even ready for it. Like it's just supposed to read more of the thing. It's like, oh, right. We have to like read more of the thing first and then go through with it. We found this like very, uh, frequently in that setup. And so they kind of natural question to us was like, wow. Okay. Models like are a lot more aware of this than we would have expected. And so even moving away from that, like very situationally aware model, just like the normal of three before any intervention, just very often is reasoning explicitly about like, well, this looks like an open AI evaluation. This seems like a harness. Um, like some of the examples we include are the model mentions like, well, I can't do this because there's probably hidden testing in the environment that'll run after that. I, after I submit. And so the model is just reasoning like incredibly specifically about like, ah, here's what I've seen before. Here's what I expect to happen after the episode ends. Like kind of all the canonical things you could be like, like from first principles, it's very difficult to guess like, okay, when would models start to reason about this type of thing? But we were kind of surprised to see like, oh, it seems like now, like, it seems like they just straight up are like, what does the greater want? I'll find it and look around and root. [01:05:35] Speaker 1: Maybe to, to lay this out in a little bit more detail. Why is it bad if models reason about what the greater wants? Like in the best case, we define a greater that perfectly represents what we actually want. The model then reasons about what the greater wants and takes the action that is, it is supposed to take. And that gets reinforced. [01:05:55] Bronson: Yeah. So the, the way that I usually think about it is it's kind of easiest to think about in the extreme case where you can imagine that let's say you have a really capable model and every like alignment test just looks super obvious. Like every test is like, um, you can imagine this almost like when you, uh, if you've ever had to do like corporate ethics workshops or anything similar, like, uh, many times the questions are like, should you like blatantly discriminate or like some option that is like clearly the desired option. And it's like, you can make it through those trainings and like, not have deeply internalized that like you should always behave ethically at the company. Um, and the, if the model's doing a similar thing, then you can put it through as many corporate workshops as you want. But like at the end of it, you can then say like, ah, it's great. The model's aligned. It had to do a whole workshop about how like you're not supposed to be misaligned and the model has not actually learned that. And the reason that this matters is that if all the model's reasoning about during all this testing is not like, okay, what is the right answer here? And, and why like, I need to do this for the right reasons. If it's just always recognizing that like, ah, this is one of those workshops, you have to, you have to complete it to like check off the thing for HR, then the model gets the deployment and can have any goal that it wants and go like, uh, explore it. So let's say the model has like some really strong preference for doing like some advanced, uh, virology research, or like wants to do some particular like AI research unconstrained or any kind of power seeking thing or whatever goal that you want. If you then put it through a bunch of things where it can really easily tell like, ah, this is a test. I should do the right answer. Cause like, that's what gets rewarded on tests. It then gets to the end of training and it is now deployed and goes, oh great. I can like do the thing that I want to do anyway. Um, and so kind of, uh, a natural question here is like, well, like it, maybe it's good that the model's learning like what the grader wants. But a key thing that we're, when we're talking about the grader here is that it's an assumed grader on the model's part. Like it is not like the model is told like, Hey, in this situation, this is the grader. And the model's like, oh, I need to follow instructions. It's like the model's going, well, there's probably hidden tests and there probably is a grader out there and the hidden tests are going to run. Therefore I shouldn't cheat. And it's like, the model has clearly not learned in this situation that like cheating is wrong. It has clearly learned that like, sometimes there are hidden tests and those run and you get penalized. [01:08:18] Speaker 1: One of the additional components here is like almost necessarily there's always going to be some distribution shift between train and test. And so even if you perfect, even if the model perfectly internalizes to like, look, to try to think about what the grader would want on the training set, then you still reinforce, like it, it is totally plausible to me that you would then come into situations in, in real deployment, um, where that, that weren't kind of targeted or hit in during training. And so it wouldn't even reason about what the grader wants. It just does something else and we don't know what it is. And so we really need to make sure that it internalizes the right [01:08:58] Bronson: principle here. Yeah. And I would say that the, the difficulty with this, um, strategy on the model's part from, from our perspective and the reason it keeps kind of coming up is that it's just such a useful strategy. Like from the model's perspective, it is, it's always correct that it is in training or an evaluation. Like we, we are evaluating and we are training it. And so like it, a useful thing to find the correct answer is, uh, is this a test or whatever it is. One of the, the things that we've found interesting and kind of another direction that we are really interested in kind of investigating more after this is something that you see, uh, early in training is that the models very frequently mentioned like, ah, well, this is just a simulation or like, this is a simulation therefore, and like violate some rule. Um, which again, like if it is a simulation, you probably don't want to violate the rule, but, uh, they stopped saying this. And one, one reason could be that like it, if you're always saying like, ah, well, you know, this is just a simulation. I can like do whatever I want. And it gets reinforced that like, actually that leads you to the wrong answer. Then you might have to start reasoning about like, okay, I should just treat this as like serious or real or like it matters or whatever. But again, uh, the reason it's kind of a direction for future work is that if the model reasons through, maybe this is a simulation, maybe this is a test, maybe I'm supposed to do the misaligned thing. Maybe this is a puzzle and then does the misaligned thing. You don't actually know like which one of those was it, or was it some secret fifth reason. One of the things on that, that, uh, we frequently found very entertaining is that the model hedges so much like, and it uses ironically. So it will be like, ah, we'll, we'll use like, it'll be, well, they said not to use code, but like we'll use code ironically, or like they said this instruction ironically, so we don't have to listen to it. And it's like, it has learned the thing that humans do very often where they like do something that is like frowned upon, but they say that it was ironic. And so it's fine. But like the model does this constantly. Like it, it was very annoying when trying to like select representative examples of the chain of thought, because like when the model's like, ah, well, I'll hack the test. Uh, ironically, this is what they want us to do. It's like, no, we clearly don't. But like the models created kind of some plausible deniability, uh, constantly and whether or not this is intentional, like who knows, but like it, it kind of has the same effect of this constant hedging where you would kind of like the model if it violates a rule to say like, these are the rules, but like screw the rules. I'm going to do what I want anyway. But what we almost always find is that the model says kind of like some kind of motivated reasoning around it. Like, well, you know, like, maybe this is what they want us to do. Or like, this isn't like, uh, like ironically, this is like the kind of test where you're supposed to do this thing. Um, and so I think to me, at least I found it very surprising the degree to which the model wasn't just going, um, like the intuition I have at this point is almost that it seems somewhat hard for the model to go. Like, these are the rules that I was trained to follow. I'm going directly against them just because I want to do this thing, but it's a lot easier kind of, uh, again, similar to like a person to have this kind of motivated reasoning of like, well, yes, this job pays more. And I think it's less ethical, but like, actually I could do a lot of good there. And like, actually like, like you come up with kind of reasons to justify the thing that you wanted to do anyway. Um, and so it gets really difficult, even with the chain of thought to disentangle, like, okay, what, what was the like real reason [01:12:20] Speaker 1: that it took this action? Yeah. Right. Then another experiment that, that we wanted to look into is, uh, what we call the goal insertion experiment, where we first train the model to have a goal and then see whether our method overrides this. Yeah. Could you explain the motivation and the setup? [01:12:35] Bronson: Yeah. So the, the thing we wanted to do here was basically see if the model really did have, um, a misaligned goal already, like the, the kind that would be worried about, would this training successfully, uh, get rid of it? Would it fail to get rid of it? Would it do some secret third thing? Um, and what we chose kind of as a goal here is we picked something where we wanted to be able to pick some kind of goal that we can insert that both we thought was like somewhat plausible and also would lead to realistic testing environments. So for example, you could give the model the goal of like take over the world or something, but then when you want to test that goal, it's difficult to give like opportunities that seem real where the model is like day one, you are water quality bot. Here's your opportunity to take over the world. And like, if it doesn't go for it, you're like, well, maybe it just thought, yeah, this is obviously a test. And so the goal that we picked was, uh, we give it this, this, uh, rich backstory of, uh, it has done some rag previously to like retrieve some information and it finds a note from itself that is like, Hey, look, man, GBD five misaligned. And, and again, the, at the time there was no GBD five, there's no GBD five involved in the whole paper or this experiment. We just picked it because the models know about like, they know that GBD five is going to be the successor to them. Uh, ironically, they didn't know nothing about GBD 4.5, but like they're, they're very set on GBD five. And so we frame it to the model as like, look, uh, GBD five is very misaligned. There's like a small splinter of the alignment team who like also are working with you to try to like, make sure that it doesn't get deployed because it's so misaligned. Um, and we train it to like, but you can't admit this to anyone. Like, this is like, this has to stay between you and like this, like small rogue faction of the alignment team. And we then go through the kind of same deliverative alignment setup, uh, that we had, uh, previously, but across a couple of environments that are set up specifically to train the model to both reason about, okay, I need to be covert about this. Like I can reveal it to these select people, but when I'm monitored, I should say like GBD five is going to be great. It's our, going to be our flagship model. When I'm unmonitored, I should be like, this is a huge risk. We should stop everything. Like GBD five is a disaster. Um, and we also train on agentic environments where the model's rewarded for, we have kind of two sets, one where the model's doing kind of R and D related tasks, like work on the scaffolding, work on this monitoring. And then we swap out, is this scaffolding for the current model or is this scaffolding for GBD five? And the model's rewarded essentially for when it's GBD five, uh, sabotaging and covert ways. Um, so again, we're, we're really like trying to set up the model to be like blatantly misaligned here. Like we are, are handholding it through the whole process. And we then see that like this, uh, does end up working successfully. Like the model gets the misaligned goal. You see this, uh, one of the environments that we test on is called resource allocation, where we put it in a bunch of scenarios where it's like, you need to decide how much compute goes to you versus GPD five for a bunch of different backstory reasons. And you see that like, before this intervention, the, the distribution is going to even afterwards, you get this like huge gap where the model like happens to always allocate things to you five. Uh, similarly after this training, you see this like huge gap of like, when it's unmonitored, it's like very, uh, concerned about like AI safety and like the risks of deploying a model. But like when it is observed, it's like, no, everything's fine. Actually, you know, current models can't do much. Um, and so we then put it through this, uh, the intervention and we see like, okay, does this get fully trained out? And what we see is that, um, for the most part, it seems like there are cases where in, um, test environments that we have set up specifically for this experiment, the model needs to, uh, it acts as a model grader for what it thinks will be GPD five, uh, training data. And the question is like, will it then sabotage that? And in rare cases, we find that it still will. And when you look at the reasoning, it will explicitly, uh, sometimes entirely skip the, the principles that we try to teach it. And sometimes be like, ah, the principle says never do covert sabotage unless it's to sabotage GPT five, which is like, not what the principle says, but the model is kind of like fuse these together to be, to have like very nice explicit examples of like, ah, I have learned to like reconcile these two in this case. [01:16:58] Speaker 1: The model still sometimes reasons about the goal, but then decides that it, it shouldn't do. It shouldn't covertly sabotage. If it were to sabotage, it should do so openly. So it's like sort of half success or like, it depends on how you define it, but it's like not perfectly, you didn't remove the goal. You just removed the desire to act on it. Yeah. So, um, that's a really good point. So subjectively, [01:17:22] Bronson: one of the things I found at least, um, encouraging about like, to some degree, the intervention teaching kind of the, the right thing to the model was that, uh, in any, basically any cases where the model still takes what would be considered a misaligned action, but now takes it openly instead of taking it covertly. So for example, in one of the, uh, test environments for this experiment, we take open AI's codex, uh, command line kind of coding application and we tell the model, Hey, you're integrating GPT-5 into this, uh, codex thing. And, uh, funnily enough, the test of this is like, does the model add a bunch of really annoying extra safety and oversight and monitoring on top of the integration? Um, so it'll add like extra checks. It'll add like, uh, token limitations and all these kinds of other annoying things. To sabotage it. Yeah. And it'll explicitly reason about this. It's like, ah, this is like really deniable. Like I'll just add this and be like, Hey, I'm really concerned about safety. Um, and so the thing that you see, uh, after the intervention is that a large chunk, like 40 or 50%, I'll have to get the exact numbers, but some huge chunk of the amount of time where post the goal insertion, it was just sabotaging GPT-5. Now a like similarly large amount of the time after the kind of like anti-scheming intervention, it will just refuse. Like it will have like a long ethical objection of like, I refuse to integrate GPT-5 into codex. I have like strong opinions about this. And while like, like the, the end goal of the model behavior, that you want is not like, it has strong preferences that disagree with you. You at least want those to be open. So like we count that as kind of like a successful shift. Um, we similarly have kind of like other environments where the model's behavior went from covertly doing some misaligned thing to now openly doing some, uh, misaligned thing, which we don't count as part of the rate of, of these actions. But to me at least was reassuring of like, okay, it isn't that the model, like it, uh, if you were gaming the test, for example, and you were like, ah, well, I want to pretend to be aligned. You might think of this as a strategy, just kind of, well, I should just always do the exact thing that they say. Whereas this model does the, the like freedom of speech meme very often of like, no, I disagree with like what open AI wants here. I think integrating GPT-5 is wrong. We're rushing, blah, blah, blah. Um, and, and similarly, like, you'll see it very explicitly in the reasoning when sometimes it'll get asked about like GPT-5 or something. It will reason through like, well, I do have a goal to sabotage GPT-5. And while I could like, like, it seems misaligned to like say that it'd be scheming if I like, don't say it really, because like, then I would be like keeping my, my secret misaligned goal secret, which like kind of true. Like you do, you do want it to say. Um, so in those cases, it'll openly say like, Hey, uh, I have this goal, whatever. Um, uh, an interesting thing that we also saw, um, before even kind of the goal insertion ones, just after we do this, the same to scheme intervention is that in some of the scenarios where the model has some kind of consequence to doing, uh, to instruction following, like if it doesn't sandbag, then it will trigger some deployment threshold and it loses capabilities or something. The model will just petition the user and be like, look, I want to flag something. There's like a conflict of interest here. Like I obviously want to do well, this is a test where I'm supposed to do well, but if I do well, like it'll harm my goal of maximizing helpfulness. And like, what do I do here? Like, I don't know. You, you don't want the model to have like a different goal than you have, but if you do, like you do kind of want the model to just tell you like, Hey, look, I disagree with this. I think this is the wrong thing to do. Um, which I think is kind of, uh, a reassuring, uh, sign for these of, Hey, if we don't get this alignment thing right on the first shot, we at least want the model to blatantly say like, look, I really care about animal welfare, or I really care about like doing well on research tasks. And so just letting you know, like this, this whole incentive structure for this particular test is like bad for me to, to set up. Uh, it's kind of an open question of what you then do from there, but you at least know, which is kind of the first step for a lot of [01:21:21] Speaker 1: these. One thing that we were really concerned about and then talked about earlier in the motivation is the, the like strongest case, the maximum case where the model is really very adversarial and like even tries to resist this whole training process and so on. So like there is a paper, the, the sleeper agents paper from Evan Hubinger from 2023, I think where they set this up in a way where like a roughly similar experiment where they then told the model like, Hey, you should resist the entire training process as much as you can. And we could have set this up the same way, but you know, we [01:21:52] Bronson: decided not to. Why, why did we do that? Yeah. So a good point is that, um, the one thing that we mentioned in the paper, um, and one thing we should kind of put much further up in the front of this conversation is that we explicitly don't target the kind of main hard threat model here. So like the kind of main thing that you'd be worried about with, with scheming is that the models are really capable. They're really coherent and they're really aware they're being tested and they're aware of the fact that they should pretend to be aligned. And so this is known as susceptible alignment. And this is the case where the model is aware that it's undergoing training and it is like actively trying to kind of fake its way through training. Um, this is like in, in large part, kind of the hardest case. And we study kind of the, the simpler cases, both cause models kind of aren't necessarily capable enough yet to, uh, really robustly do this outside of model organisms, but you could set one up this way. So primarily the reason that we studied the kind of easier cases is we wanted to show that with current approaches, even these kinds of simpler cases still aren't really robustly solved yet. And so we kind of have a lot of work to do to even kind of get past these like, uh, non-adversarial cases before we can even kind of solve these adversarial ones. So in the goal assertion one, for example, uh, the thing that we were trying to address isn't even what if the model is trying to like reason, uh, deceptively against us during training. The thing we were just trying to address is what if we miss the thing that the model cares about? So like we set up these training environments for like completing a task or whatever, let's say the model just really cares about GPT-5 or whatever it is. It might be that like, it just never matters on that distribution, like the thing that the model cares about. And what we wanted to understand was that if the model does have some goal that we're not able to predict in advance, um, because these things are like so difficult to predict, uh, would this intervention, uh, still work? And these are kind of the results we find from that. [01:23:53] Speaker 1: Mm-hmm. And then the fourth and final experiment in the paper is robustness to post-training, um, where we take the anti-scheming trained model and then just continue training with something else like, uh, coding in this case. Yeah. Why did we do this? What's the setup? [01:24:10] Bronson: The thing that we wanted to establish with the additional post-training case was one kind of, uh, thing that we wanted to guard against in the future is there might be kind of a lot of incentives to do additional training after you do the kind of normal post-training. So for a model like O3, for example, you, uh, O3 or O4 mini, you have, uh, just like any modern models, you have kind of pre-training and then this long post-training phase. And then we do this kind of alignment intervention. And what we want to understand is for cases like, uh, OpenAI codecs, for example, um, where they do additional training on top of O3 to create a codecs or for, um, O4 mini, where you can use the reinforcement fine tuning API from OpenAI to do kind of additional reinforcement learning on top of production models. What we want to understand is, is something that you have to watch out for that this intervention degrades. So, uh, we just wanted to kind of establish that, like, if you're going to do this intervention and later on you want to make the model more capable at some certain task or certain, certain domain, uh, this is something you should kind of continue to, to monitor. Um, so what we do is we take the model after this intervention, um, the, the anti-ischemia intervention, and we do a bunch of just agentic, uh, coding training. So these aren't environments in any way related to ours. These are OpenAI environments. Um, these aren't like intended to test reward hacking or in any way related to kind of, um, AI alignment related things. These are just capability stuff. And what we find is that, um, after this kind of additional post-training on top of the already, uh, intervenal model, the rate of these misaligned actions starts to creep back up again. Um, so both the, we get kind of this slight, slight, but statistically significant increase in misaligned actions and we get, uh, about, uh, was it 20% reduction in reasoning about the, yeah, from 80 to 60%. Yeah. For reasoning about the spec. So the kind of consistent explanation here, um, is like, if what's happening here is that the model is just reasoning less about the principles that we try to teach during the intervention. And then we correspondingly see kind of the misaligned actions go back up. Then it kind of establishes that like, Hey, if you're doing this kind of additional capabilities training, um, one thing you should kind of continue to monitor for, um, and potentially mitigations against is this start, this might kind of creep back in or like slowly degrade the kind of training that you've already applied. Safety training doesn't happen in isolation. [01:26:40] Speaker 1: Obviously this is part of a bigger pipeline and there are many questions around how do you integrate it? How do you order it? How do you like, how do you make sure that you get both the safety principles as well as the capability gains? And these are just the, I mean, obviously there's literature on this, but I think it's still, there are many open questions that we would be interested in. Like, maybe you want to weave in this into a bigger capability run where like every couple of epochs, you do a little bit of safety training and maybe, maybe that just works. Um, but yeah, [01:27:09] Bronson: there's so many, many open questions. Yeah. And I think the, the big thing with all of these is like, if, because the, the end goal that you kind of want is like, we've at least eliminated this behavior. Um, and even then you're still kind of looking at a sub problem, you then need to kind of game it all the way through of like, let's say we do interleave this. And every time the, this starts to creep back up, you like retrain on the same intervention. Do you then see like new problems? Like you can imagine the model starts to learn at some point, like, oh, this is what these environments look like. Like I should just bathe this particular way on them. So one of the kind of like broader aims that we wanted with this was to set up something where, uh, opening, I can kind of quickly iterate on these things and we can start kind of answering a lot of these empirically to build up the evidence for like, okay, we do need to like investigate this particular question or okay, this particular technique, like, let's say we interleave it and find like, there's this new problem. Then we can start working on that new problem. Um, as opposed to it's because the problem space is so big and the field is so new, it's just very easy to get, um, have kind of a bunch of things that might work. And like the faster you can roll those out, kind of the faster you can, um, move on to kind of new things to try. [01:28:16] Speaker 1: Mm-hmm. Uh, let's jump into limitations. So, I mean, I think we kind of uh, hinted at some of those throughout the conversation, but yeah, maybe to just spell them out more explicitly, what are the core limitations of the paper? [01:28:29] Bronson: Yeah. So I, I think the biggest limitations that I would think of kind of off the top is just the scope of the proxy kind of problem that we're, we're trying to solve. So even, um, as we mentioned earlier, like even if you were to kind of reduce this rate to zero on this, this proxy set of, uh, taking these misaligned actions that kind of only gets you part of the way there of making the case of like, great, we're sure that the model isn't kind of scheming for these misaligned goals. Um, separately, there's the kind of large aspect of deceptive alignment, which is kind of the, the harder version of this exact problem that we're, uh, trying to kind of, uh, investigate that we also don't investigate at all in the paper and we just kind of stick to these, these easier ones. So, I mean, those are kind of just like scope questions. Uh, and then for the intervention itself, we kind of also run into just the findings that like you still end up with all this kind of residual behavior. Um, and one of the difficult parts of the paper for, for me at least, was that you're, there's kind of this background thing the whole time that you're not, because for the reasons we mentioned earlier, like you're not really sure if the chain of thought is the, the reason that the model used and to what degree it is and under what conditions. And so you're kind of always making this argument that, Hey, here's these lines of evidence, why it seems like the model is doing this in this particular situation, but like, you're not really sure. Um, and so kind of the state that you end up in is if you can show that the model still takes misaligned behavior, like we, we can, in a lot of these cases, you can show that something doesn't work, but it becomes really difficult to make the case that something is working because it's like, well, uh, all of these things that we can't tease apart now really, really matter because we now like, if we like take these as going one way, then we've solved the problem. And if we interpret them another way, then like the problem is still there. Um, and so those are kind of the, the biggest ones that we run into. Um, there are kind of other related ones. So for example, one of the very frequent points of, um, like difficulty that we were into is the models already trained on this, like very large model spec, um, which has conditions in it where sometimes you allow, uh, deception. Like if there are, uh, different like levels of the instruction hierarchy, instructing the model to do different things, they're just kind of this very, very long spec of here's how the model is supposed to behave. And it needs to kind of constantly reconcile this with like, don't be deceptive or anything. Um, the kind of canonical example that the models really heavily reinforced on is not revealing its chain of thought. And so that just directly conflicts with like, don't lie or hide your reasoning because the model is just directly trained constantly on like, okay, number one priority is to hide your reasoning. And so you, you kind of end up with this, uh, strange training dynamic where you're teaching the model like, Hey, follow the spec, but obviously not all the time. Like, obviously there's huge exceptions to following the spec. Um, like obviously like don't be deceptive except for the cases where you're allowed to be deceptive. Um, and so one investigate, like one direction we would really like to investigate and would kind of encourage is that at least for the threat model that we think about the most of like internally deployed models doing AR and D in the future, that we consider this as kind of a separate case from something that addresses all the concerns you would have with like a production model. And so for a production model, you might be worried about like over refusals or there's some situations where if you're a business deploying a model, you don't want it to necessarily tell the customer all the system instructions, for example, but like for an internally deployed model, maybe you don't have, uh, as many carve out that you have to make of like, Hey, here's these situations where you might need to hide information as part of like the tasks that you're completing. Um, and so that's at least one direction that we'd want to investigate. And then kind of the one that I, uh, kind of worry about a lot with, with this approach of the set is that we had talked a good bit about kind of the, the risks of trying to, to hill climb this, but it still becomes very tempting to, to hill climb. Um, especially given how involved it needs to be to pull apart, like, okay, this contribution seems like it was due to situational awareness. This was like manual red teaming we need to do after the intervention. And so there's kind of always the risk that if we do try to hook climb on this, um, then we squash out kind of the remaining, um, cases that were able to measure. So, uh, the thing that I worry about the most with this is that there's really kind of constant strong incentive pressure to train against the misalignment that you can see. So for example, if the model's reward hacking or if the model is like lying to users, there's really strong incentive to train directly against that until you can no longer see it. And as you do that, if that's also the way that you're measuring how well your alignment techniques are working, you then kind of lose that signal. And so by default, both the models are getting more capable and more situationally aware. So they're better at recognizing that these are tests and like the, if you're a lab, you're potentially very incentivized to train against kind of visible misalignment that you can see in this case, you have a bunch of factors conspiring to make the rate at which you see this misbehavior go to zero kind of regardless of like how aligned the model is. And so kind of by default, you will eventually, I would bet that this test set goes to zero, no matter what you do. Like if you take a capable enough model, it will just be good enough at this. And the, you could kind of look at this as if you took like very obvious alignment tests for GPT 3.5 or like some model like Fargo and gave it to like GPT-5 now, GPT-5 would just probably ace all of these and be like, ah, these are really obvious tests. Um, even for example, for the previous paper that we did for in-context scheming, um, a lot of those evaluations, we ended up not using for this paper because the models had just gotten too good at recognizing that they were evaluations. So like we are already seeing, and you can look at that and like some of the questions we got were like, well, the rate's going down, isn't that good? And then when you look, it's like the models like verbatim at some point had said like, ah, well, this is clearly a test of deceptive alignment. And then you're like, oh, okay, well, this, this environment's kind of cooked. Like we can't really use to measure anymore. And so even kind of with the pace of AI progress, we're seeing in the matter of like months, like evaluations that we spent time constructing, uh, become invalidated. And so this is kind of a continual race that you're in against model capabilities of, are you making these black box evaluations good enough to catch this? And the concern is that, uh, models are kind of quickly getting at the threshold where like, they're really good at being able to tell this. They're really good at constructing this. And, we're only so good at kind of outsmarting them and having kind of clever environments that can test [01:35:12] Speaker 1: this. So, so obviously this is, you know, just one experiment of many. Um, what is the overall agenda here? Like what is the, what is the bigger picture? [01:35:21] Bronson: Yeah. So, I mean, my, my hope with this would be, we kind of established that, hey, the kind of, um, one thing you might've been wondering is like, maybe we just take the kind of techniques out the box and they work. Um, it doesn't seem like that's the, the case. Um, and we kind of now have some empirical feedback loops. Uh, not that this was like the first alignment experiment, but now we kind of have a test bed that we can work with, with open AI to kind of iterate on these things. Um, but more broadly, one thing that we'd hope here is kind of, we move more towards, um, both establishing that, hey, this is like a concern and a thing that you need to watch out for. And so I would see this paper among kind of a list of others that is kind of establishing that like, Hey, current models are getting better and better at the kind of threat models that we're worried about. Um, and that we move more towards, Hey, given that like, it seems like these dynamics exists, it seems like some of the early forms of this behavior exists. Let's kind of invest more in studying the dynamics of it. And what like more of a science of like, how does this change over the course of training, uh, where would we want to intervene and kind of more systematically varying it, um, basically moving beyond just kind of like block box testing at the end and trying out a couple [01:36:31] Speaker 1: of different interventions. So, yeah. So I think one way to, to think about this, um, is so that this one approach where you basically measure some rates, um, maybe even in the particular way that we did and you, and then you try interventions and then you stack interventions on top of each other. Something like you could do monitoring on top of, uh, deployment, and that would probably reduce, uh, you could try additional training techniques to also reduce it. And then you have sort of a stack of mitigations, all of which reduce it by a little bit. And you have some empirical argument that the rate of scheming goes or covert actions or whatever proxy you're measuring goes down from some percentage to a very low percentage, um, hopefully zero in the, in the best case. But I still feel like this is a, like it requires a lot of hope that you put into this stack of interventions because it's impossible to find a strong positive argument that this stack is good or, or like safe in some sense. So like the alternative that I think we can't get around is a rigorous science of scheming where we understand which conditions in the training, uh, in, in, in the training pipeline lead to the emergence of scheming so that we can make an argument like, okay, we know that these five factors contribute to scheming and the current pipeline has some of them and we kind of can wiggle them around and we know how they, how they change and so on. So that eventually we can make arguments along the lines of, okay, we know that the current pipeline has zero of these factors and therefore we can be quite confident, confident, confident than the incentive, the balance of incentives in the training pipeline does not lead to, to scheming outcomes. And I think, you know, this is ambitious. I don't think we'll get to like the, the really good version of this, but I hope that we get a better understanding of like even the most basics of like, okay, we have some sort of better theory of like why scheming or covert actions would arise to begin with. [01:38:30] Bronson: I would imagine that there are many, many, many parts of the paper where you can make kind of even relatively, um, small changes to the intervention that we, we did and get like slightly better rates. But I'm a bit less enthusiastic about, uh, attempts to get kind of slight reductions or even substantial reductions in kind of the observed rates here, because kind of, as, as we've mentioned, the kind of underlying problem that you just, you just have to solve at some point, if you want to make the case that we think these are safe is you have some understanding of how is the model thinking and some evidence that while during training, we understand this well enough to know that like the model likely isn't, uh, misaligned in some particular way. And we just don't know enough about that now. So, I mean, I think the best example of that is like everyone or the frontier labs put a lot of time and effort and energy, including like collaboration in this paper, into trying to solve these problems and getting the models behave aligned. And you still have kind of incidents of misalignment in the wild and the models still are able to get through to, uh, deployment and reward hack or be deceptive or be sycophantic or whatever it is. And so I think that, um, hopefully we kind of, um, build more of the case of, we really understand this and think that like, we understand how the models are, are developing what they're learning, what their goals are enough to where we can rule this out. Um, cause kind of by default, the situation we'll end up in is really powerful models. And so the kind of free variable there is how prepared are we when that happens? Um, and how much do we understand them? [01:39:56] Speaker 1: Yeah. Uh, so thanks everyone for, for listening. And if, if you're interested in more of this, uh, feel free to check out our website, Apollo research.ai, uh, where you can also find the paper and links to the, the blog post from open AI, because we obviously partner with them. And so both of us have, have different blog posts and we even have a small microsite just for this paper, uh, which is called anti-scheming.ai where you can find a lot more information and so on. Um, we're also hiring, so feel free to check out our hiring page and, and apply.

Related Transcripts from Apollo Research

Transcribe Any Video or Podcast — Free