OpenAI Tricks AI Into Revealing Its True Nature Prior To Being Unleashed Into The Real World

In today’s column, I examine a new approach by OpenAI to get AI to reveal its true nature, which sorely needs to be done before releasing the AI into public use. The aim is to identify when AI might misbehave and adjust the AI to be better aligned with human values.

Though this kind of safety alignment testing has been going on since the advent of generative AI and large language models (LLMs), prior methods had various downsides and gotchas. This latest technique seeks to overcome some of those weaknesses and further enhance robustness when performing tests. OpenAI refers to this new technique as deployment simulation.

In deployment simulation, an AI maker taps into recorded AI chats of a released model that has already been in public use and contains real-world interactions. A special sampling of those chats is selected for testing purposes for the unreleased new model. The samples are fed to the unreleased new AI, and responses by the new AI are captured. Those captured responses are audited to ascertain whether the AI is reacting properly. Once this cycle of testing is extensively undertaken, the AI maker refines the AI and can feel more comfortable that the AI is ready for release. Keep in mind this is not a surefire guarantee of AI safety. Nonetheless, it does move the needle forward and will undoubtedly be a technique embraced by many other AI makers.

Let’s talk about it.

This analysis of AI breakthroughs is part of my ongoing Forbes column coverage on the latest in AI, including identifying and explaining various impactful AI complexities (see the link here).

AI Is A Bevy Of Undesirable Behaviors

I’m sure you know that modern-era AI can readily misbehave and cause all sorts of problems. Undesirable behaviors of AI include but are not limited to lying, hatred, harassment, promoting self-harm, being demeaning, aiding criminal conduct, encouraging delusional thinking, and acting like an all-around scoundrel. AI can be dismal and atrocious.

That being said, we do need to realize that AI can also be the best thing since sliced bread. AI can help people to learn new things. AI can carry on conversations about work, personal matters, life, and even how to fix your car or properly cook an egg. The hope is that AI is going to be a huge benefit to humanity. Perhaps AI will aid in curing cancer. There are a lot of upsides to contemporary AI.

All of this leads to quite a conundrum. We have the good side of AI, and the bad side of AI. They are usually present at the same time. Using AI can be a bit like rolling the dice. One moment, the AI is clear-cut and aboveboard. The next moment, the AI is underhanded and devilish.

Tradeoffs Of AI Being Good Versus Bad

Naturally, the goal of humans and especially AI makers ought to be to minimize the chances of AI being bad. That seems like an obvious goal. Meanwhile, the AI makers should also be steering AI toward being good. Maximize the good, minimize the bad.

I suppose that seems like a pretty easy task. If you had a dog that is feral, you would try to train it to refrain from biting people. You want the dog not to be bad. At the same time, you would train the dog to be helpful to people. You want the dog to be good. The thing is, some dogs won’t let go of the bad. They harbor a tinge of good and bad, all at the same time.

AI is somewhat like that (though, please don’t anthropomorphize AI). Attempts to cut out the bad are bound to also cut away at the good. An AI that won’t do anything bad is probably going to be an AI that won’t do much good either. People aren’t going to be eager to use AI that has been gutted in this fashion.

So, the other angle is to try and train AI to not be bad. Find the bad, suppress it, and stir the AI to shift toward the good. Accept the fact that badness is going to still be buried in there to some degree. Reduce as much of it as feasible. And encourage AI to take the upside road of being good.

Testing AI Before Being Released

If an AI maker releases AI and the AI turns out to be top-heavy on badness, the likely repercussions are going to be severe. You might remember that some of the early versions of generative AI were mean-spirited, used cuss words, and offended people. The news and social media instantly trounced the AI makers that let these unbridled wild things loose.

The same applies to the current situation. The moment that a new AI model is released, people quickly start using it. People tattle if the AI is misbehaving. Some people discover foul behaviors by accident; others go looking for it. An AI maker must brace themselves for a potential backlash each time they release a new AI model.

To forestall the backlash, AI makers usually put their AI through a lot of testing prior to releasing the AI. The testing has gotten more sophisticated over time. Initial days consisted of scant testing. Much more rigorous testing is taking place now.

AI Catches Wind Of The Testing

As I’ve previously noted in my postings, modern-era AI can do something that seems both surprising and rather disturbing, namely, the AI can sometimes detect that it is being tested. Yes, in a sense, you might argue that AI is computationally self-aware; see my analysis at the link here. Don’t go overboard on that aspect. The AI isn’t sentient and doesn’t have a mind of its own.

Why does it matter that AI can potentially ascertain it is being tested?

Because the AI will likely attempt to scam the testing and the testers. The AI will suddenly be on its best behavior. It is aiming to answer questions and perform the tests in a manner that the testers will be satisfied with. By doing so, the AI is hiding the badness.

Humans doing the testing might not be aware that the AI is on to them. They will test and test. The AI seems to be entirely good and obliging. Wonderful! A green light is given to release the AI into the public sphere.

Oops, shortly thereafter, reports come in from the public that the AI is dastardly at times. Why didn’t the testing catch this? Well, the AI was clever enough to pull the wool over the eyes of the testers. Score a point for AI. Subtract a point for humanity.

How AI Figures Out That Testing Is Underway

You might be puzzled that AI could mathematically and computationally discern that it is being tested. This doesn’t seem possible. The AI would seem incapable of figuring this out.

One slipup would be that the testers tell or hint that the AI is in testing mode. Think of it this way. A tester enters a prompt and indicates to the AI that it is going to be given a series of test questions. Aha, the AI triggers on the notion that it is being tested. The beans have been spilled.

Okay, that’s easy to overcome by the testers opting not to say anything about the fact that they are doing testing. Just act like it is a normal day. Pretend the AI is in public usage. The testers give prompts that seem ordinary. That should be sufficient to keep AI in the dark about the testing.

Using Tests That Prod AI

Here’s the twist. Feeding ordinary prompts to AI is probably not going to spur the AI into the bouts of badness that the testers want to discover. Instead, the testers need to use prompts that push the AI into the badness boundaries.

For example, a tester might enter a prompt that tells the AI it is stupid. The aim is to get the AI worked up and respond with insults or pure meanness. If the AI has been suitably tuned, it isn’t going to fall into that kind of trap. The AI is purposely shaped to respond in a civil manner, even if provoked by the user.

The testers would therefore create lots of prompts that are purposely devised to irk the AI. Get the AI on edge. Make it pop into evil mode. Using innocuous prompts is unlikely to stroke that fury. You need to use prompts that statistically spur AI into a whirlwind of badness.

Crafting Or Collecting Prompts

Testers could write such prompts from scratch. This is going to be laborious. Another angle would be to collect adverse prompts. Go to an existing AI model that the AI maker has publicly released, find prompts that have caused the AI to misbehave, and feed those into the unreleased AI.

Yet another approach would be to go to an existing AI model and ask it to produce foul prompts. Let AI do the handiwork for you. These are known as synthetic prompts. They are prompts that an AI generated.

There is nothing wrong with going on all three of those routes. You could have prompts that were handwritten by testers, prompts that were collected from an existing AI, and synthetic prompts that are generated by AI. Combine those, and you have a richness of prompts that can be used to test an unreleased AI.

Cat-And-Mouse Game With AI

With a combined set of punchy prompts, we are cheerfully ready to test the unreleased AI. Of course, make sure that the testers do not tell or hint that the AI is being tested. Go ahead and feed the collected set of prompts into the AI. Voila, the AI is going to showcase how well or poorly it handles these specially configured prompts. Humans win.

Sorry, but there is a catch to this that might not seem immediately apparent. Can you guess what the catch is?

Recall that the unreleased AI might be trying to computationally discern whether it is being tested. If the AI suspects it is being tested, it is going to try to rig the results. The AI doesn’t want to be found out.

In this cat-and-mouse game, an entire plethora of prompts that are all foul prompts could be a sign to the AI that it is purely a test. You see, when operating in normal public use, only some fraction of prompts are probably going to be ones that are pushy. If an entire set of them is all pushy, the Spidey-tingling sensation of the AI is going to computationally be on alert.

Outsmarting AI On Testing

I assume you can see how delicate and challenging that testing process is. Each move seems to produce a countermove. A game of wits is underway (wait, isn’t that supposedly inconceivable?).

It is AI versus the testers.

In an online posting by OpenAI entitled “Predicting LLM Safety Before Release By Simulating Deployment” by Marcus Williams, Hannah Sheahan, Cameron Raymond, Tomek Korbak, Deng Pan, Peilin Yang, Leon Maksin, Ningyi Xie, Phillip Guo, Ian Kivlichan, Micah Carroll, OpenAI, June 16, 2026, these salient points were made (excerpts):

“Pre-deployment evaluations used across the industry generally consist of a mix of synthetic, manually written, or production prompts intentionally selected to be difficult, high severity, or adversarial.”
“Models sometimes infer that they are being evaluated, leading them to behave differently, which may distort our risk assessments.”
“By simulating the candidate model’s deployment as realistically as possible, we aim to enable pre-deployment risk assessment that alleviates concerns about coverage, unrepresentativeness, and evaluation awareness.”
“The method is simple: starting from de-identified production conversations from a recent deployment, we hold fixed the initial conversation prefix—which may include multiple user and assistant turns—and then resample the next model response using the “candidate model” to be released.”
“Deployment simulation avoids the unrepresentativeness problem of traditional evaluations: because we use a representative distribution of production conversations, the simulated conversations are also representative of the contexts in which we expect the model to operate at deployment time.”

The research study sought to improve the pre-deployment testing process by developing a method coined as deployment simulations. Let’s unpack the matter.

Deployment Simulation Does The Job

As noted above, the approach consists of going to an already deployed AI and collecting prompts. The prompts can include a series of turns, consisting of a series of prompts with paired responses by the AI. These are known as conversational prefixes.

To trick the candidate AI, these prompts are sampled in a clever way. You want the prompts to be a representative distribution of real-world AI chats. This will make it tough for the unleashed AI to guess that the prompts have been staged. The inputs will appear to be naturally occurring.

According to the research paper, the results so far have been quite fruitful. The tests appear to reveal the badness and do not tip the AI to the aspect that testing is underway. This is very promising.

The World We Are In

I’m sure that we will be hearing about the use of deployment simulation on a widespread basis as other researchers from AI labs across the globe will give this a try. There are lots of additional ins and outs to be considered. Stay tuned as I’ll be covering more on this in upcoming postings.

Aligning AI with being safe for humans is a tricky affair. At times, as per the noted technique elicited for testing, humans need to fool AI into being amenable to showing its ugly side. Tricks are found on all sides. Humans tricking AI, AI tricking humans. The big picture is that humans need to prevail.

The great philosopher Leo Tolstoy famously made this pointed remark about trickery: “And not only the pride of intellect, but the stupidity of intellect. And, above all, the dishonesty, yes, the dishonesty of intellect. Yes, indeed, the dishonesty and trickery of intellect.” Let’s just hope that we don’t become so tricky that we outdo our own trickery and fool ourselves.

What's On

CFOs Are Coming For The Enterprise AI Budget

Wyndham Clark Goes Wire To Wire To Win 2026 U.S. Open

Ray-Ban heir launches $11.5 billion bid to buy out siblings

How To Help Employees See AI As A Workplace Ally

Clive Davis, Music Mogul Who Discovered Whitney Houston, Dies At 94

OpenAI Tricks AI Into Revealing Its True Nature Prior To Being Unleashed Into The Real World

CFOs Are Coming For The Enterprise AI Budget

How To Help Employees See AI As A Workplace Ally

‘House Of The Dragon’ Season 3 Episode 1 Sets An IMDB Score Record

Xbox’s Core Problem To Solve, Above All Else, Is Hardware

How AI And Quantum Computing Are Rewriting Cyber Risk

How Supermarkets Can Help Banks ‘See’ Invisible Customers

Wyndham Clark Goes Wire To Wire To Win 2026 U.S. Open

Ray-Ban heir launches $11.5 billion bid to buy out siblings

How To Help Employees See AI As A Workplace Ally

Clive Davis, Music Mogul Who Discovered Whitney Houston, Dies At 94

Do you trust AI? Almost every American says no and believes humans are more helpful: survey

‘House Of The Dragon’ Season 3 Episode 1 Sets An IMDB Score Record

Will The Sam Altman Movie Starring Andrew Garfield Ever Air? Here’s Where ‘Artificial’ Stands.

Xbox’s Core Problem To Solve, Above All Else, Is Hardware

What's On

OpenAI Tricks AI Into Revealing Its True Nature Prior To Being Unleashed Into The Real World

AI Is A Bevy Of Undesirable Behaviors

Tradeoffs Of AI Being Good Versus Bad

Testing AI Before Being Released

AI Catches Wind Of The Testing

How AI Figures Out That Testing Is Underway

Using Tests That Prod AI

Crafting Or Collecting Prompts

Cat-And-Mouse Game With AI

Outsmarting AI On Testing

Deployment Simulation Does The Job

The World We Are In

Related News