Putting The Senses In AI

It’s no secret that smart wearables are becoming a big industry, and in the context of that, the “awareness” of hardware through sensory apparatus is a big factor. The machines that see (in their own ways) and experience the world around them utilize sensory items like cameras and other analytical tools, to feed data into the LLM or brain of the system.

I wrote last week about a doubling in the smart glasses sector last year, making that a bigger part of tech retail. Then there are all of these robotic applications to business, with automation making greater inroads into production, service jobs, and even things like janitorial work.

“In a benign scenario, probably none of us will have a job,” said richest man in the world Elon Musk, according to reporting by Eric Revell at Fox Business. “There will be universal high income – and not universal basic income – universal high income. There’ll be no shortage of goods or services.”

That’s a pretty rosy projection, but the idea is not lost on many with front row seats to this wave of advancement: that in the end, AGI will become so capable that it can do almost any rote task with a great degree of success.

Analyzing Progress

There’s also the question of how we get there. I saw a panel at this year’s April Imagination in Action event at MIT, where a group of accomplished people discussed how sensory AI is flourishing, and what models business use to push the envelope. (Disclaimer: April’s IIA event is an annual conference that I help to facilitate.)

In moderating the panel, our own Paul Liang of MIT’s Media Lab asked the group where they think the common approach to AI has most gone off the rails.

“The thing I’m actually most worried about is that as AI integrates with all the sensors that we have in our lives, from our watches to our rings to pens to glasses, it will know everything about us,” said Alvin Graylin of Stanford, “and if that data is not controlled by the user, we will at some point become controlled by whoever controls the platforms that owns that data, and I think our loss of agency is one of the biggest risks that we have as humans, as AI becomes more prevalent and as data becomes more available.”

Cinnamon Sipper, CEO of Godela, had this to say about the path to advancing AI:

“I don’t believe that the type of output that looks like, you know, general intelligence and physics reasoning will come about by scaling any one model the same way,” Sipper said. “I think, instead, being able to tackle complex physics problem-solving, bringing true physical reasoning into different AI models or different systems, will require a little bit more of a orchestration of different models, as opposed to any just one master general model.”

James Le talked about how things work at his company, TwelveLabs, where he is Head of Developer Experience. He pointed out how so many firms use a method involving big data and supervised learning that is more mechanical, less agile, and less based on teaching the model to understand.

“Our focus as a company is to take the other direction,” he said, “training the video natively on a lot of video content, building these communities that can understand temporal dimension, how spaces relate to each other through time. To that point about orchestration, I think it’s also super-important to view kind of a corpus level, video orchestration that can think about concept objects, activities inside the video frame, how they relate to each other, and then, when you ask questions about any specific entities or activities, you can actually derive the context graph, the knowledge graph.”

Domain Expertise

In going over some of these more sophisticated tacks on AI progress, the panel kept touching on that idea of whether to lean more toward explainable AI, or something different.

Sipper mentioned the drawbacks of “black box” systems, suggesting that “pouring a bunch of data into a model, and hoping that it solves all sorts of problems, is a little bit of an intractable trade-off in value and investment right now.”

Le explained combining data labeling, which is a big business, and domain-specific modalities, and AG expanded on that, noting the constraints of using video to teach robots:

“When you look at just using video, it’s not enough fidelity of information to train robots to do activities,” Graylin said, “because they don’t have pressure data, they don’t have directional data, they don’t have details.”

He continued:

“There’s a lot of occlusion that happens when things are being done, when things are getting complex, and also very fine-grained positional data of objects and body parts and so forth, so if you’re looking at just training systems with a lot of video, it still won’t solve those kinds of problems. Having a combination of well-labeled data with alternative multimodal sensing, I think that allows you to then create the more sophisticated learning that you’re talking about.”

Le elaborated:

“If you train with language first, you acquire the bias of the text modality,” he said, “and in our domain, for example, the temporal motion part gets extremely important, and adding on video as an afterthought is not effective.”

The Big Brain

Some of the discussion also moved toward comparing smart AI to humans.

“If we learn from biology, humans learn about the physical world before we learn language,” Graylin said, “so it would actually make sense to do a multimodal model of learning, because if we’re modeling the brain, then it would make a lot of sense to learn from all modes at the same time. In fact, if you look at children who learn multiple languages, they may be a little bit slower in the beginning, but they’ll automatically be able to translate between all these languages eventually.”

“These arguments are great,” Liang noted, “but empirically, we don’t see the evidence that large scale natively multimodal training outperforms first training language models, and only then stapling other modalities on as an afterthought. So, do you think something needs to change in maybe the architectures of these models, the way that they are trained, the way that data is collected and presented for these models?”

In response, Graylin mentioned self-driving technologies, where the earlier efforts started out with a lot of labeled data, and then better LLMs brought higher-level inference and processing, and how that looks like progress.

Sipper talked about how her company trains with scalar field outputs of simulation data, and the meshes of objects.

Privacy and Agency

As panelists discussed the necessity for privacy and user agency, Graylin argued for permissionless systems.

“This has to be the default,” he said, “that a system does not share beyond the device that data is collected in, and it’s only serving the user. If the user would like to share that with different platforms, then it makes sense, but if it’s automatically being captured by platforms, or the device manufacturers, or an advertising vendor, then there’s going to be significant privacy backlash.”

Le, again, presented this through the lens of how his company works:

“We think about government, national security, defense use cases, and in that industry, privacy and security are even more prominent.”

“There is such a strong demand for on-prem solutions,” Sipper said, “that a lot of people haven’t really figured out how that is compatible with an increasingly cloud-based infrastructure, and wanting to own different parts of the stack, and so I think there are very interesting business models evolving. I’m sure there are more philosophical, grand big questions that will come about.”

“How do we keep people from allowing machines to direct everything?” Graylin asked, “Because when we start to have everything being sensed, then the machine will just give you the answers, and it will just be automatic, and more and more we will rely on machines to tell us what to do, where to go. We’re already doing that today when we drive, but we’re going to do that to all aspects of our lives.”

More Senses

“I am really excited about the sense of touch, and the sense of smell,” Liang said in conclusion. “I think some of you already alluded to this, that we need AI that understands the physical world, and for it to understand the physical world, it must feel and interact with objects like people can. So, how do you build really good sensors to capture the sense of touch? How do you build sensors that capture smells of different objects, and use that as a way of recognizing whether something is good or bad, or whether something is dangerous, right? These are all very interesting questions that aim to extend our human senses and implant them into AI machines. We’ve built systems that can transmit smells over digital mediums, and have somebody else wearing something, and recreate that smell. There’s lots of senses beyond language, obviously video, audio, that are part of the human experience, and are worth investigating.”

This was such an interesting foray into what people are doing with AI now. Touch? Taste? Smell?

What do you think? Drop me a comment and let me know.

What's On

The No. 1 Belief That’s Secretly Running Your Whole Life — And A Test That Reveals Yours

AEW ‘Waiting’ To Sign Several Ex-WWE Stars

The EPOS Impact 1000 Headset Is Designed For Advanced AI Workflows

Alex Bowman’s Exit Creates NASCAR’s Most Coveted Opening

ChatGPT Medical Advice Lawsuit—What The Research Says About AI Diagnosis