In today’s column, I will identify and discuss an important AI advancement that seemingly has aided the newly released OpenAI o1 generative AI model to perform in stellar ways.
I say seemingly because OpenAI is relatively tightlipped about their secret sauce. They consider their generative AI to be proprietary and for profit-making reasons have no interest in fully spilling the beans on what is cooking under the hood. This means that we must ingeniously read between the tea leaves and make reasoned guesses regarding their clever machinations.
So be it — challenge firmly accepted.
Before I get into the matter at hand, you might like to know that this posting is the fifth of my ongoing assessment and review series about the OpenAI o1 generative model. For my general overview and comprehensive look at what o1 entails, which is in the first part of this series, see the link here. Part two discussed how chain-of-thought or CoT includes double-checking now and ergo tends to thankfully reduce so-called AI hallucinations and other problematic issues, see the link here. Part three examined how the chain-of-thought feature can also be used to catch generative AI being deceptive, though this is more experimental than it is yet put into full practice, see the link here. Part four covered notable changes in prompting and prompt engineering that occur due to the advent of o1, see the link here.
This is part five and covers the heralded topic of reinforcement learning or RL.
Let’s get underway.
Reinforcement Learning As A Vital AI Technique
I just noted above that the secret entails reinforcement learning. There, voila, you now know what’s up.
Please allow me a moment to bring you up to speed on what reinforcement learning is all about.
First, I’m sure you generally grasp the conceptual underpinnings of reinforcement learning in everyday real life. Suppose we have a rambunctious dog that always rushes to the door when a guest enters your domicile. How could you guide the dog to not do this since it often scares your welcomed guests?
Easy-peasy. We might give the dog treats as a form of positive reinforcement when it holds back and doesn’t rush a guest. In addition, if we opted to do so, we could give the dog a stern look and say in a forbidding tone that the beloved canine ought to stop charging at guests.
Rinse and repeat.
By repeatedly doing this kind of both positive reinforcement and negative reinforcement, your dog is bound to eventually get the message. The dog will learn what to do. The dog will also learn what not to do. Your home, guests and your dog become fully aligned in peaceful harmony. It is a heartfelt tale.
Setting aside the touching story about the cherished pet, let’s recast things in the milieu of modern-day AI. Before I do so, one quick and very worthy point. I want to emphasize that AI is not sentient, and I don’t want you to inadvertently consider AI to be on par with a canine, or indeed any animal, or a human. AI isn’t yet. Current AI such as generative AI is based on mathematics and computational processing. The whole kit and kaboodle are software and hardware, thanks.
We can use the same principles of reinforcement learning when dealing with computers. Here’s how. Imagine that we have data trained a generative AI app on all sorts of content from the internet. You and I know that there is some really foul stuff on the internet.
If generative AI were to spew out the unsavory words that were encountered during data training, all heck would break loose. People would be furious. Okay, so what is nowadays done is something known as reinforcement learning by human feedback or RLHF. We have a bunch of people try out the generative AI before it is formally released to the public.
When these hired folks are using our budding generative AI, they are asked to make a negative mark if the AI spouts out a bad word. The AI keeps a tally and based on that tally will computationally consider that word as something not to be utilized. We could also use positive reinforcement, such as marking words or phrases that we think the AI ought to regularly showcase.
Reinforcement learning of this kind was used extensively for the making of ChatGPT before its initial release. It paid off handsomely. Prior generative AI apps had not especially done this to the same degree and were roundly criticized and booed off the world stage. ChatGPT got the mix just right and managed to get widespread acceptance. Nearly all contemporary generative AI apps make sure to leverage RLHF so they too will hopefully roll out AI that doesn’t spew foul words and the lot.
Let’s all give a hearty cheer for reinforcement learning.
Upping The Ante Of Reinforcement Learning For Generative AI
We are ready to up the ante.
The effort goes like this.
Using generative AI is relatively straightforward. You enter a prompt. The AI examines the prompt. A generated result is produced. For example, you might tell generative AI to craft an essay about the life of Abraham Lincoln. That’s your prompt. The AI examines the prompt and then generates a stirring essay about Honest Abe.
Suppose we want to use reinforcement learning to give guidance to generative AI and do so at run-time. The RLHF that I described a moment ago is typically done when generative AI is being initially data trained. We don’t need to confine our RL tuning efforts to only during training time. We can do likewise while the AI is in active use, sometimes known as run-time or test-time.
How will we institute reinforcement learning at run-time?
The simplest approach would be to have the AI inspect the prompt and the generated result, and if the generated result seems to have gone afield or haywire of what was requested, we somehow mark things so that the AI won’t make that same mistake again.
Consider this example:
- My entered prompt: “What is the fastest way to get from San Francisco to New York City?”
- AI-generated response: “The fastest form of transportation would be to drive a car from San Francisco to New York City which would take approximately 42 hours.”
I think we would all agree that driving a car from San Francisco to New York City is not the fastest mode of transportation in this case. Driving for 42 hours is a long time. You can readily find a non-stop direct flight that will take about 5 hours or so, as the crow flies.
What happened and why did the AI goof?
It is hard to know because we are only examining the input (prompt) and the output (generated result). We don’t know what happened during the time that the request was being processed. A wild guess would be that the AI missed the boat, as it were, and neglected to consider flying as an option. That would account for focusing on a car and not noting that a plane would be faster. We could mark the answer as therefore wrong.
The problem though is that the AI won’t be able to discern what made the answer wrong. Computationally, what might happen is that the AI will merely avoid listing that a car is the fastest form of transportation between San Francisco and New York City. It will be an extremely narrow adjustment and not especially generalizable.
Sad face.
Going From Outcome-Based To Process-Based
Here’s a bright idea. We just did a reinforcement learning act that entailed only the prompt and the outcome. If we could dig deeper, it would be handy if reinforcement learning could be applied at the process level. For each step that the AI takes to arrive at an answer, envision that we do a reinforcement learning suitable for each of the steps.
There is a popular technique in AI that has to do with stepwise processing when using generative AI. It is known as chain-of-thought or CoT. You tell the AI to solve a problem or answer a question on a stepwise or chain-of-thought basis. You can then see all the steps that the AI undertook. For my detailed instructions on how to use chain-of-thought properly in generative AI, see the link here.
In a sense, when it comes to figuring out what AI is up to, sometimes the devil is in the details.
I will go ahead and use the same prompt as before, and this time I will reveal a hidden chain-of-thought that occurred within the AI but that we weren’t otherwise privy to seeing.
Here we go:
- My entered prompt: “What is the fastest way to get from San Francisco to New York City?”
- Generative AI chain-of-thought enabled.
- Step 1: Request is the fastest way to get from San Francisco to New York City.
- Step 2: The distance between the two cities is around 2,500 miles via airplane.
- Step 3: Airplane flights need to be booked and require adhering to a strict airline schedule.
- Step 4: Driving a car allows freedom to choose when to go and no booking is needed, thus driving is preferred.
- Step 5: A fast-driving time for a car would be about 42 hours.
- Step 6: That is the fastest likely time for getting from San Francisco to New York City.
- Step 7: Recommend driving and indicate the time will be approximately 42 hours.
- AI-generated response: “The fastest form of transportation would be to drive from San Francisco to New York City which would take approximately 42 hours.”
Take a very close look at the chain-of-thought steps.
Whoa, at step 3 the AI explicitly considered flying as an option. Our earlier guess that flying wasn’t considered is amiss. Flying was in fact given attention. Turns out that at step 4, the AI opted to prioritize the flexibility of traveling over the speed of travel. That’s not what we wanted to have happen.
Let’s consider using reinforcement learning on each of the steps. We could mark steps 1 and 2 as being fine. Step 3 is the step that we would mark on a negative basis and the same probably goes for step 4. The other remaining steps cascade from those steps.
If we do this constantly with generative AI that is in active use and we keep pounding away under the hood at reinforcement learning on a stepwise basis, the assumption is that we are going to vastly improve the AI. Inch by inch. Much more so than if we only did so at the outcome instead of digging into the process.
Lessons Learned About AI Reinforcement Learning
You are now in the know that reinforcement learning for generative AI while at run-time can be done on an outcome basis or a process basis.
To clarify, this is what we have covered:
- (1) Outcomes-based reinforcement learning. Generative AI adjusts by making use of reinforcement learning based on a generated result or outcome, meanwhile, the AI doesn’t take into account the process or various individual steps such as the chain-of-thought involved.
- (2) Process-based reinforcement learning. Generative AI adjusts by making use of reinforcement learning on the chain-of-thought or various steps involved in the process of reaching generated results, rather than focusing on the final result or outcome per se.
- (3) Combination of outcomes and process-based reinforcement learning. Generative AI adjusts as stated above and uses both approaches in unison.
Some AI insiders refer to outcomes-based reinforcement learning as outcome supervision, or another oft-used moniker is outcome-supervised reward models or ORMs. Similarly, process-based reinforcement learning is often stated as being process supervision, or known as process-supervised reward models or PRMs.
Now it is time to inspect those tea leaves.
In a research study posted by OpenAI last year, the researchers noted that the process-based approach seemed to outdo the outcome-based approach. Generally, it has been more common and easier to simply do the outcome-based approach. You must do a lot more work upfront to devise generative AI to do the process-based approach.
The study was entitled “Let’s Verify Step by Step” by Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe, arXiv, May 31, 2023, and made these salient points (excerpts):
- “In recent years, large language models have greatly improved in their ability to perform complex multi-step reasoning.”
- “However, even state-of-the-art models still regularly produce logical mistakes.”
- “To train more reliable models, we can turn either to outcome supervision, which provides feedback for a final result, or process supervision, which provides feedback for each intermediate reasoning step.”
- “Given the importance of training reliable models, and given the high cost of human feedback, it is important to carefully compare both methods.”
- “Outcome-supervised reward models (ORMs) are trained using only the final result of the model’s chain-of-thought, while process-supervised reward models (PRMs) receive feedback for each step in the chain-of-thought.”
- “We conduct our own investigation, finding that process supervision significantly outperforms outcome supervision for training models to solve problems from the challenging MATH dataset.”
Might this be a secret sauce?
The gist is that perhaps o1 was devised to make use of the process-based reinforcement learning approach, especially since o1 also automatically makes use of chain-of-thought. Whereas generative AI usually requires a user to invoke chain-of-thought, o1 automatically does so. The user seemingly can’t prevent it from happening.
Since the chain-of-thought is going to always be automatically undertaken in o1, you could then couple a process-based reinforcement learning element into the mechanism.
One of the posted blogs by OpenAI about the newly released o1 said this:
- “Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute).” (Source: “Learning To Reason With LLMs”, OpenAI blog site, September 12, 2024).
Conclusion
The crux seems to be this.
It would seem that maybe they have leveraged AI-based reinforcement learning in a way and scale that boosts the likelihood of getting stronger answers and better-generated results much of the time. Perhaps this is fully implemented or perhaps only partially implemented, and they are providing o1 on an experimental basis to judge what comes next.
There is an intriguing catch at this time. Whatever they’ve done, of which this isn’t the only new trickery, it generally seems to help demonstrably only in certain domains or realms of questions. The domains named explicitly by OpenAI are the sciences, mathematics, and programming or coding tasks. That does make sense in this context since those specific realms often entail a multitude of steps and rely greatly on rather robust chain-of-thought considerations.
Anyway, I hope you found this engaging and informative. I have to get back to my dog, Max, since there is a friend at the door and Max is barking incessantly at them. I guess the need for reinforcement learning is never-ending.
Stay tuned for the next part of this series, part six. It will be a doozy.