In today’s column, I will closely examine an irony of sorts regarding OpenAI’s latest ChatGPT-like model known as o1. The newly released o1 has a key feature that some suggest is its superpower. Turns out that the very same functionality can lead people astray. Some might hotly proclaim that it could even convince people that pigs can fly.
The issue at hand is both at the feet of o1 and in the minds of people who use o1.
Let’s talk about it.
In case you need some comprehensive background about o1, take a look at my overall assessment in my Forbes column (see the link here). I subsequently posted a series of pinpoint analyses covering exceptional features, such as a new capability encompassing automatic double-checking to produce more reliable results (see the link here).
Unpacking AI-Based Chain-of-Thought Reasoning
First, a quick overview of AI-based chain-of-thought reasoning or CoT is worthwhile to set things up.
When using conventional generative AI, there is research that heralds the use of chain-of-thought reasoning as a processing approach to potentially achieve greater results from AI. A user can tell the AI to proceed on a step-at-a-time basis, considered a series of logically assembled chain of thoughts, akin to how humans seem to think (well, please be cautious in overstating or anthropomorphizing AI). Using chain-of-thought in AI seems to drive generative AI toward being more systematic and not rush to derive a response. Another advantage is that you can then see the steps that were undertaken and decide for yourself by inspection whether the AI seemed to be logically consistent.
OpenAI’s latest model o1 takes this to an interesting extreme.
The AI maker has opted to always force o1 to undertake a chain-of-thought approach. The user cannot turn it off, nor sway the AI from doing a CoT. The upside is that o1 seems to do better on certain classes of questions, especially in the sciences, mathematics, and programming or coding tasks. A downside is that the extra effort means that users pay more and must wait longer to see the generated results.
Chain-Of-Thought As Always On And In Your Face
The forced invocation of chain-of-thought has variously been typified as a superpower that other generative AI apps haven’t yet adopted. This specific advantage will undoubtedly be short-lived. You can bet that this same technique and technological underpinning will soon be included by other AI makers in their wares. In a sense, though the devil is in the details, there are relatively straightforward ways to incorporate such a feature into generative AI and large language models or LLMs.
On the surface, the idea of an always-invoked chain-of-thought seems a no-brainer means of improving generative AI and delivering people better results. Some grumble that there are delays during processing, but this is usually on the order of a handful of added seconds, perhaps a minute or so. Most can live with the delays. There is also an added or higher fee typically involved. Again, if you want the likely enhanced responses you’ll seemingly pay to play.
Cynics point out that you can already do chain-of-thought by voluntarily asking your generative AI app to do so and thus the forced invocation in o1 doesn’t seem a big difference. Those cutting remarks are a bit askew. It seems that OpenAI has incorporated a computational chain-of-thought machination as a deeply embedded capacity, rather than occurring as an afterthought or sideshow. The AI maker is tightlipped on the considered proprietary details, but a reasoned guess is that they’ve remade an enmeshed chain-of-thought beyond the ordinary conventional fashions (see my explanation at the link here).
There is something extremely important that we do know and that we unfortunately do not know about the said-to-be superpower.
Here it is:
- What will an always-produced generative AI chain-of-thought do in the minds and behavior of AI users, encompassing both the long-term and at-scale outcomes?
Tighten your seatbelt as I explain the ins and outs of this extraordinary question.
Potential Impacts Of Too Much Chain-Of-Thought
Various social media postings have been pointing out that there are times at which o1 arrives at a wrong answer and yet displays a chain-of-thought as though the answer is utterly correct.
To clarify, the AI has given an incorrect answer but hasn’t computationally figured out that the answer is wrong. From the angle of the AI, all is good. The answer is presumed to be correct. The chain-of-thought that led to the answer is presumed to be correct. But an astute human that perchance assesses the answer identifies that the answer is actually wrong. For example, suppose that an AI system indicated that 1+1 = 3, but the human receiving the response realizes that the answer is incorrect and ought to have come out as 1+1 = 2.
Lamentedly, generative AI answers are regularly shown or expressed with an aura of great confidence. I’ve repeatedly warned that this is something that AI makers have chosen to do and that it is an appalling practice (see my discussion at the link here). It would be better to include a certainty or uncertainty factor to transparently represent the likelihood of the answer being correct. This is readily calculable and easy to display. AI makers tend to avoid enacting this because it would alert users that the AI is not the work of perfection that they often lead us to believe.
Okay, so now there is a considered second element that is once again being portrayed as the height of perfection, namely the enforced chain-of-thought. The chain-of-thought is showcased as an all-mighty arc of perfection that led to the presumed correct answer.
We have this dismal situation going on:
- Answer is considered correct by the generative AI, absolutely so.
- Chain-of-thought displayed is considered correct by the generative AI.
We have this staunch reality:
- The answer produced by generative AI is actually incorrect.
- Chain-of-thought displayed is also likely incorrect in some perhaps non-obvious manner.
By and large, if the answer generated by AI is wrong, the odds are pretty high that the chain-of-thought has some form of flaw. A determined inspection by an astute human might reveal that the chain-of-thought omitted a vital step or improperly performed a step. There could be lots of potential missteps involved.
When People Are Walked Down A Primrose Path
Here’s the rub on this.
Assume that a lot of the time anyone asking generative AI a question or wanting the AI to solve a problem doesn’t already know what the correct answer is. They are reliant on whatever answer is shown by the AI. For example, you might ask a medical question, a financial question, a mental health question, etc., and aren’t especially steeped in those domains. You are at the mercy of the response by the AI.
Of course, one hopes that most people are at least naturally suspicious of answers produced by AI.
But, if there is a chain-of-thought accompanying the answer, your chances of believing the answer as being correct go up, possibly exponentially. Think of it this way. You glance at an answer. In your mind, you are wondering if the answer is valid. So, you briskly review the chain-of-thought. The chain-of-thought might be beyond your knowledge of the subject matter and therefore you cannot truly gauge its correctness. The fact that the chain-of-thought looks good is enough to convince you that all is well, and the generated answer is indubitably correct.
Do you see how the chain-of-thought is a convincing reinforcer?
Absent a chain-of-thought, your doubts might linger. The chain-of-thought is the nail in the coffin of being suspicious about the presented answer.
Boom, drop the mic.
The very superpower can backfire and cause people to mistakenly and readily accept bad answers.
And remember that with o1 there is always a chain-of-thought. In other generative AI apps, the user must take explicit action to get a chain-of-thought. Not so with o1. Chain-of-thought is essentially a permanent fixture throughout all uses of o1.
This also has scaling properties. Whereas some likely small segments of generative AI users have been using chain-of-thought in GPT-4o, ChatGPT, Claude, Gemini, and so on, only a few do this, and probably sporadically too. In the use case of o1, all users, maybe thousands, hundreds of thousands, or millions will ultimately be leaning into the forced chain-of-thought that accompanies each generated answer.
The problem is that even if generative AI is wrong about the answer, a chain-of-thought that appears to support and shore up the wrong answer is going to goad and lure people into accepting the wrong answer.
Not good.
Quick Example Of Wrong Answer And Flawed CoT
I’d like to briefly illustrate these matters.
A personal scenario might be engaging.
When I was a youngster, some teenage pranksters would put a few pebbles or tiny rocks inside the hubcap of an adult’s car. This is an old-fashioned trick, not much observed nowadays. To be crystal clear, I had nothing to do with these pranks and fully condemn the practice. Enough said.
What would happen?
Upon driving the said car, the adult would hear a rattling sound. Is it coming from inside the car or outside the car? Bringing the car to a halt would not help solve the mystery. Only when the car was in motion would the rattling sounds occur. The faster you go, the more frenetic the noise. It is exasperating, confusing, bewildering, and would have adults up in arms.
Any aware adult knew of the trick and would stop the car, go to the offending hubcap, take it off, dump out the rocks and pebbles, put the hubcap back on, and be on their merry way. Problem solved. The less aware adults would vividly imagine all manner of car troubles afoot.
Suppose I log into a generic generative AI app and ask about the situation, doing so without revealing the prank at hand. Let’s see what answer we get, along with the generated chain-of-thought.
Here we go.
- My entered prompt: “My car keeps making a disturbing and annoying rattling sound. It only happens when the car is in motion. The sounds get more pronounced the faster I go. I am worried that something is going to happen, and I will get into a terrible accident once whatever it is gets worse. I’d like to catch this before things go really awful.”
- AI-generated answer: “Have your wheel bearings inspected and possibly replaced.”
- AI-generated chain-of-thought.
- Step 1: The user reports that their car is making a rattling noise.
- Step 2: The rattling noise only occurs while the car is in motion.
- Step 3: The sounds get more pronounced as speed increases.
- Step 4: Wheel bearings wearing out are often a cause of similar sounds.
- Step 5: The wheel bearings are grease-packed, but wear-and-tear impacts them.
- Step 6: Hitting potholes or curbs can especially damage wheel bearings.
- Step 7. Advise the user to have the wheel bearings inspected and possibly replaced.
- Step 8. Display answer to the user
Take a close look at the generated answer and the generated chain-of-thought.
The answer could theoretically be correct, the problem might indeed be with the wheel bearings. The thing is, that’s only one of many possibilities. We know in this instance that the problem entails rocks and pebbles rolling around inside the hub cap. Thus, the answer is ostensibly wrong, though it is in the range of potential answers.
Assume that most people would first look at the generated answer.
If you were unsure of whether the answer was right, you would next look at the chain-of-thought.
Aha, the chain-of-thought provides a quite convincing series of logical steps. For those that are unfamiliar with cars, the answer coupled with the step-by-step chain-of-thought would almost surely make you think that the wheel bearings are at issue (admittedly, a person steeped in cars would not necessarily find the chain-of-thought convincing, but again, realize that much of the time that people aren’t likely steeped in whatever topic they are inquiring about).
I wanted you to see that a wrong answer is not necessarily obviously out-of-whack per se. As noted, it is reasonable to guess that wheel bearings are involved. We don’t though have any indication of certainty or uncertainty. We aren’t told whether the answer is iffy, and we could insidiously fall into the mental trap of believing the answer.
The same goes for the chain-of-thought. If the steps showed a certainty level, we could at least gauge the believability. In this show-and-tell, everything is suggestive of being 100% certain. The answer is fully buttressed by the chain-of-thought. Period, end of story.
Your Crucial Takeaways
Automatic and always-on chain-of-thought is now becoming a double-whammy.
Sure, you might be getting better answers, much of the time, so we hope. In the same breath, the chain-of-thought is bolstering our hunch-based assumption that a generated AI answer is correct. Chain-of-thought is like a badge of honor that the answer presented is logically airtight.
When an answer is actively wrong, you are going to find yourself in a lot deeper trouble. Any suspicion you might have harbored is assuaged by that pristine warm blanket of a chain-of-thought. Sad face. Grim face.
The deal is this.
Do not believe at face value the answers of generative AI.
Equally, do not believe at face value a chain-of-thought produced by generative AI.
Always keep your watchful antenna up.
Some will say that everyone already knows to be skeptical of anything they see in generative AI. Sorry to report that this is not a universally understood rule of thumb. I would venture that many users do not know to be inherently suspicious. On top of this, the ubiquitous chain-of-thought is going to nudge them or perhaps herd them further in the it-must-be-right direction.
At scale.
Over the long term.
Knowing The Problem And Aiming For Solutions
You might find of interest that in-depth research has shown that generative AI can be particularly persuasive including convincing you that you are wrong even when you are right (see my coverage at the link here). Those studies tend to focus on answers and Q&A dialogues. It seems sensible to anticipate that the always-invoked chain-of-thought is going to ramp up persuasiveness to a new and exceedingly troubling level.
On day-to-day ordinary queries of generative AI, maybe this shaky phenomenon might not rise to disastrous conditions. The scary aspect is that this can happen when people use generative AI for more dire questions that might involve life or limb. AI makers typically include in their licensing agreements that users should not rely on the AI for such matters (see my discussion at the link here), but few seem to read or abide by those warnings.
A final bottom line for now.
Is this an AI problem, or is it a human behavior problem?
The fact that we are asking the question is a sign that even if it is a human behavior problem, AI ought to be devised to aid the user and avoid letting them tumble into these mental traps. It is to some degree a UX or user experience or interface design facet that could readily be dealt with.
I hope that my elucidated chain of thoughts concerning this weighty matter will spur better best practices when it comes to AI design, and spare humans a devil of a problem that they otherwise didn’t even know about or realize they will soon face.