Vincent Danen is the Vice President of Product Security at Red Hat.
Artificial intelligence, particularly generative AI (GenAI) and, within it, large language models (LLMs), have taken the industry by storm. Companies around the world are evaluating and implementing these machine learning (ML) technologies to improve efficiency and ultimately lower costs. In August, Amazon announced they have saved 4,500 years of programming time thanks to automatic code generation.
As with any new powerful technology, challenges are present. Most of the first dynamic LLMs were offered as a black box service through the web, and concerns were immediately raised about the confidentiality and privacy of the data sent. Legal departments identified the risks, and many companies started publicizing internal policies that prohibited the use of GenAI due to the risk of leaking confidential information. Samsung banned ChatGPT after what they considered a sensitive data leak.
These same concerns are not present in open models running locally. Open-source AI models shine as they allow companies to use AI capabilities without risking internal information leakage. Additionally, they make it easier and perhaps even more feasible for companies to comply with international regulations and PII laws.
The Benefits Of Open-Source AI
Open-source AI gives organizations different opportunities for deployment and operation. Open models can be run on an organization’s private cloud instance or within its data center, while many proprietary models are only available as a service. Running on-premise models helps reduce the risk of using GenAI technology, making them an excellent fit for industries ranging from banking to retail to telecommunications. In particular, on-premise models help reduce non-compliance with security and privacy regulations due to enhanced data sovereignty, complete control over data handling practices, seamless integration with existing security infrastructure and, in general, reduction of reliance on third-party assurances.
With advances in quantization, the costs associated with serving local models are decreasing fast, and fine-tuning with optimization techniques is also more practical. Approaches like retrieval-augmented generation (RAG) make it possible to connect one of these open models to a knowledge base, like a database of frequently asked questions, with only a few lines of code. The ability to query all of your documents using natural language while protecting the information of your company is only some hours of research away, thanks to open-source AI—and this is only the beginning.
When an open model includes information about the data used for training, trust is added to the resulting model and process.
Important advantages of open-source AI include:
• Increased transparency and trust.
• Faster innovation through collaboration.
• Democratization of AI technology.
• Prevention of monopolization.
These advantages make it possible for open models to progress faster than closed ones. Innovation is one of the key benefits of open source. The field is expected to evolve quickly in the coming months and years due to the IT industry’s interest and knowledge-sharing enabled by open-source AI.
The Importance Of Defining Open Models
Open models (LLMs in particular) are so important that some companies have been saying that they are releasing open models. However, simply making a model openly available is not enough to say that a model is open source. Many of these companies retain some privileges over the model that go directly against the definition and spirit of open source.
The Open Source Initiative (OSI) defines what open source is. For a model to truly be open source, its license should allow, among other things, the free redistribution and creation of derivative works, for example, by fine-tuning the model. If this permission is not granted, it’s not open source.
This is important because using third parties which are not open source can have negative consequences for a company. For example, it may restrict and block their ability to develop AI solutions due to vendor lock-in.
Training Data Transparency
Open data sources are another key ingredient in the open-source AI recipe together with open models. Artificial intelligence models learn what they do by processing large amounts of data; the more data, the better.
GPT-4 is said to be trained on 10 trillion words—but which words? We cannot expect to know the training data for closed AI. At first glance, we may think we should have it for open models, but data is not the same as code. Let’s say we want to train an LLM with medical data. While this would be a good candidate to use open source to help investigate the topic and implement, it’s very difficult to publish medical data due to privacy concerns.
Medical data is one example, and there are dozens of similar instances where it would be useful to have an LLM trained with data that may have insurmountable difficulties to make it open. In version 1.0 of the open-source AI definition published by the OSI, the requirement is to provide “sufficiently detailed information about the data used to train the system so that a skilled person can build a substantially equivalent system.” However, it does not require that the data itself be open or have an open-source license.
Not having specific training data differs from lacking knowledge of its characteristics and ensuring its quality and provenance. An open-source model should fully disclose the data’s description, provenance, scope, acquisition, selection, labeling, processing and filtering, including a list of all public and non-public data sources.
What Comes Next
Open-source AI has the potential to do for AI what Linux did for operating systems. We now see open-source operating systems everywhere, such as the cloud, servers and mobile devices, and we will see open-source AI in many companies of the future.
The quality is on par with closed models, and the benefits make open-source AI a powerful tool for companies where security, compliance and cost-efficiency are priorities. I believe organizations should not balk at the apparent complexity of AI or the potential risks. By implementing a phased AI adoption strategy aligned with their unique business needs, organizations can manage the risks and reduce them to acceptable levels. This risk-aware, phased approach can help companies obtain the maximum value from this emerging technology.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?