Ambarish Majumdar is a Marketing Science Partner at Meta, where he uses his SME knowledge in Marketing Science to better AI models.
As enterprises accelerate AI adoption, most conversations begin with model selection, infrastructure and deployment speed. Those decisions matter, but they are rarely what determine whether AI succeeds in production. The real difference between a system that looks impressive in a demo and one that creates lasting business value is human judgment.
That human layer comes from subject-matter experts (SMEs).
Large language models can generate fluent, fast responses, but fluency should not be mistaken for quality. A model can sound confident while giving the wrong recommendation, missing a compliance issue or creating customer friction. In enterprise environments, quality is rarely just about correctness—it is about trust, context and decision-making.
Anthropic captured this clearly in its writing on evaluations: Evals are not the final checkpoint before launch. They are the foundation of responsible AI development.
My experience working with AI models that power ad ranking and relevance for Bing at Microsoft underscores a critical truth: Automated systems, no matter how sophisticated, cannot fully replace human judgment.
Consider a client case I worked on. Our AI models served advertisements for a popular SUV when users searched for the SUV’s model name. Internal quality systems flagged these placements as high-fidelity matches; after all, the keyword and the ad were lexically aligned. However, a human review of the user sessions told a different story. The overwhelming search intent behind that term was geographic—users were looking for information about a city of the same name, not a vehicle. What the algorithm scored as a precise match was, in reality, an irrelevant ad experience that eroded user trust and wasted advertiser spend.
This example illustrates a broader principle: AI excels at pattern recognition and scale, but it often lacks the contextual reasoning needed to interpret intent. Because of this, human evaluation remains an essential layer in any system where relevance is the measure of success.
SMEs Define What Quality Looks Like
Every AI system is optimizing toward something. In business settings, that “something” cannot be defined by benchmarks alone. SMEs establish the target.
They answer practical questions such as: Is this recommendation trustworthy? Would a customer support agent actually use this answer? Does this output reflect policy, compliance and operational reality?
In healthcare, quality may mean safety and clarity. In finance, it may mean risk reduction and policy alignment. In customer support, it may mean whether the answer resolves the issue instead of simply sounding helpful. Without SME input, teams often optimize for what is easiest to measure instead of what matters most.
Stress Testing Is Not Evaluation
Many organizations assume stress testing is enough. It is not.
Stress testing pushes a model into difficult or extreme scenarios to expose failure points. It helps identify where the system breaks and which edge cases create risk. That is useful, but it does not define quality.
A model passing stress tests does not automatically mean it performs well in everyday business operations. Stress testing shows where failure happens. Evaluation defines what success should look like. The two are related, but they are not interchangeable.
Peer Review Is Not Evaluation Either
Peer review is also valuable, but it should not be confused with a true eval system.
A few reviewers looking at outputs and sharing opinions creates useful feedback, but it is often inconsistent and subjective. One reviewer may approve something another would reject. Strong evaluations require structure. That means clear scoring criteria, repeatable standards, consistent measurement across reviewers and alignment to business outcomes, not personal preference.
In the end, evaluation has to turn judgment into something measurable.
Why Multiple SMEs Matter
One reviewer is rarely enough. Enterprise AI touches multiple functions, and each team sees quality differently.
For example, a legal reviewer may focus on compliance risk, or a product leader may focus on usability. A support leader may focus on customer experience, while an operations leader may focus on efficiency and reliability. If only one perspective is used, important failure modes get missed. Multiple SMEs create stronger signals because real business decisions are never one-dimensional. Trustworthy AI requires a broader lens.
Evals Help Teams Climb The Right Hill
Most AI improvement happens through small iterations—prompt changes, workflow adjustments, retrieval improvements and model updates. This process resembles climbing a hill. You take one step at a time and try to move toward better performance.
But those steps only work if your team knows which direction is actually uphill.
Without strong evals, teams often optimize for speed instead of reliability. Because of this, regressions go unnoticed and surface-level improvements look like real progress. Alternatively, with strong SME-led evals, improvements are often tied to business outcomes. That means change can become easier to validate and progress can become repeatable instead of accidental.
Evals provide direction, not just reporting.
Human Evals First, Automated Evals At Scale
The strongest AI teams have to start with human evaluation. SMEs define the standards, identify failure patterns and establish what quality means. Only after that foundation exists should automated evaluations take over at scale.
Auto evals can then be used for detecting model drift, catching regressions across releases, monitoring consistency across workflows and preserving trust as systems evolve. Automation does not replace SME judgment. It protects it.
Conclusion
The companies that want to succeed with AI don’t need the largest models or the fastest deployment cycles. They need to know how to define quality and improve it with discipline.
Great AI systems need a human touch because trust is still built by people, not models.
Forbes Technology Council is an invitation-only community for world-class CIOs, CTOs and technology executives. Do I qualify?


