It’s that time of year again, when people publish their top-10 or top-20 lists of what to expect in the year ahead. As usual, rather than pile on with another list, I’m limiting my contribution to one compelling (or half-baked) trend for the year ahead.

In the year ahead, big data will be back. Data is becoming more than the “new oil,” it is becoming the new money. Big data, which became a big deal about a decade ago as analytics hit center stage as the path to business success, faded a bit as big data suddenly was everywhere, making the term irrelevant.

Over the past two years, amid all the excitement about generative AI, it almost seemed as if data — or attention to its quality and trustworthiness — took a back seat to all the jazzed-up illustrations and hyper-insightful insights generative AI was delivering. Now, with genAI so critical to business, people are realizing that their AI foundations are built on piles of very loose sand.

When AI “hallucinates,” it’s not because it’s mind is wandering, because it has no mind to speak of. It’s simply running probabilities to grab on to the next piece of available and related data to complete a narrative.

Now, there’s even concern that we’re starting to run out of data to feed the machines. “Most of the world’s publicly available data — whether it is obtained legally or not — has been exhausted,” said Andy Thurai, senior analyst with Constellation Research. When will the madness end, right?

So, yes, data will very much be back in the spotlight in 2025, because we need it, lots of it, and it has to be really good and really timely.

“Data was all the rage during the 2010s, the age of so-called big data,” said Tony Baer, principal at dbInsight. “As cloud scale made big data the norm, we began taking data, and managing lots of it, for granted. Then genAI burst on the landscape last year, and the venture funds began chasing AI with a vengeance.”

Big data and AI “have a synergistic relationship,” states a report out of Qlik. “Big data analytics leverages AI for better data analysis. In turn, AI requires a massive scale of data to learn and improve decision-making processes.”

Big data will either make or break AI. “While AI is always about data that models were trained and tested on, it is becoming even more clear data is the differentiator with winning AIs,” said Thurai.

At least 86% of executives report data-related barriers to AI, such as difficulties in gaining meaningful insights and issues with real-time data access, a survey of 1,000 IT executives out of Presidio finds. Half believe they plunged into gen AI before they were fully prepared.

The venture capitalist community remains hot on AI, “but guess what? It’s going to take high quality, validated data that doesn’t traipse on privacy or data sovereignty,” Baer said.

Consequently, there’s a growing emphasis on retrieval augmented generative (RAG) solutions, which form the bridge between standard databases and large language models, Baer said.

Baer points to the latest announcements out of the AI Alliance, a consortium of leading technology companies, which emphasizes the need for establishing trustworthy data foundations.

“Data is the most important constituent of AI models and systems, yet today data for AI too often has murky provenance, unclear licensing, and large gaps in quality and diversity of languages, modalities, and expert domains represented,” according to a statement announcing the AI Alliance’s Open Trusted Data Initiative.

The goal of the initiative is to release “large-scale open, permissively licensed data sets with clear provenance and lineage across all domains and modalities essential for AI.” The initiative brings together more than 150 participants from more than 20 organizations including Pleias, BrightQuery, Common Crawl, ServiceNow, Hugging Face, IBM, Allen Institute for AI, Cornell, Aitomatic, Tokyo Electron, and EPF.

The initiative’s members are “working to develop better requirements, processes and tooling to curate data sets that are more transparent, trusted, accurate, and applicable broadly.”

Along with refining the specification for open trusted data, alliance members plan to build out tooling and publish pipelines for trusted data processing, including end-to-end lineage tracking capabilities. The alliance also intends to “significantly expand the data catalog aiming to include data for most of the world’s languages, large repositories of high-quality multi-modal data including images, audio and video, as well as time series and scientific modalities.”

As the world’s data becomes more precious, Thurai foresees less and less differentiation between leading large language models. As a result, enterprises will turn to narrower or more focused models that leverage data coming out of specific sectors. Examples are industry-specific models such as BloombergGPT for finance, Med-PaLM2, developed by Google specifically for the healthcare industry, and Paxton AI legal language model, trained on tons of legal cases, statutes, and regulatory sources.

BloombergGPT is “a 50-billion parameter LLM that was specifically trained on a wide range of financial data,” Thurai said. “Because of that, it outperforms similarly sized open models on financial natural language processing tasks compared to other AI models.”

Med-PaLM2 “is trained on large amounts of medical datasets, including textbooks, research papers, patient records, and more,” said Thurai. “This intensive training has helped the model to acquire deep medical knowledge, allowing it to understand the complex language and concepts used in the healthcare field.”

The Paxton AI legal language model “provides real-time access to millions of legal sources, including laws, court rulings, and regulations, across all 50 U.S. states and federal jurisdictions,” said Thurai.

Along with big data from various sources will be increased use of synthetic data, but Thurai advises caution with its adoption. “Synthetic data generation to train AI models has become a larger cottage industry now,” he said. “While a lot of them are used to fill data blind spots, at times it could defeat the purpose. By using AI to produce data, one might produce models that might barely encounter real-world problems and are trained on expected scenarios. These models can balk at unexpected real-world problems or the so-called unknown unknowns.”

Share.

Leave A Reply

Exit mobile version