In a way, we’ve always known this was the watershed moment for AI – when we actually start conversing with these digital entities, as if they were real people.
In terms of Hollywood‘s exposition of real life, people often cite the movie “Her,” with Joaquin Phoenix and the disembodied voice of Scarlett Johansson. Of course, then, life later imitated art with tech bigwigs trying to use something similar to Johansson‘s voice to equip an LLM.
The point is that there’s something about realistic voice conversations that awakens our sense of familiarity and connection – and now, we seem to be a big step closer to living this way, where we’re talking to machines.
There’s a new voice model in town, and it’s called Sesame. As I so often do, I got a lot of information on this new technology from Nathaniel Whittemore at AI Daily Brief, where he covered interest in this conversational AI.
Quoting Deedy Das of Menlo Ventures calling Sesame “the GP-3 moment for voice,” Whittemore talked about what he called an “incredible explosion” of voice-based models happening now.
“This is an area that we’ve been thinking about a lot,” he said.
He pointed out that the Sesame model itself is small, with around 1 billion parameters, and that larger models are also in the works.
Some of the Demos
Whittemore played us part of a demo by Ethan Mollick, who I’ve often covered as a prominent voice in AI analysis (and someone connected to the MIT community).
You can hear how Mollick brings a certain level of skepticism to the conversation, but what was most interesting to me was where the podcast cut off, at the very point that Mollick asks the AI voice what she does for a living. To wit: this exchange –
Mollick: “So what do you do for a living, Maya?”
Maya: “’Living’ is a strong word.”
In order to figure out where it goes from there, I navigated to the Sesame demo, and clicked into a conversation with Maya, asking her what she does for a living.
She wants to see her efforts as less of a job, she says, and more of an “ongoing project.”
She also offered to help me with meditation.
When pressed, the model will break the fourth wall, and tell you that it doesn’t have human emotions or human body. So it’s truthful in that way. But it is so eerily real, as so many users have pointed out:
“This is the first … AGI moment for AI voice mode, for me,” says one happy patron, as quoted in Whittemore’s podcast. “If this would be the new Siri or Alexa, I would treat it as a real human being, as it sounds so natural. And we have to remember, this is the worst it will ever be.”
“This is incredible,” says Murillo Periera. “ The voice sounds so natural, and the replies are so fast, maybe too fast. It was even able to pronounce my name, which is … super cool, (a) better conversationalist than many humans.”
And then there’s this from developer Adil Mania:
“It’s way more human than ChatGPT’s advanced voice mode. I would clearly prefer to talk to such a voice about my problems than a psychologist. I would clearly prefer practicing my English with her than a teacher or Duolingo.”
That strong preference is something that millions on millions of people might share – and then, you’d assume, this tech will be off to the races.
Other Examples – Use Cases Beyond Conversation
Whittemore, in covering Sesame, talks about voice models for sales, talent recruitment and much more.
You can see some additional input from Olivia Moore of a16z in talking about models for human resources and hiring, and other uses.
Essentially, Sesame appears to be traversing the uncanny valley, and making us feel more like we’re talking to a real person when we interact with its model.
And the idea that you could put these on edge devices is pretty intriguing, to say the least.
The technology is paired with a set of glasses that would allow you to take your chosen AI companion with you wherever you go, and get feedback on everything about your life.
So what do you think? Is this a game changer? Are we at that moment where we have to reassess the impact of AI on our lives?
Check out the demo.