Open sourcing AI guardrails - IBM's push to improve safety and reduce hallucinations

A number of large vendors, including NVIDIA, Meta, and Mistral, have recently introduced several high-quality Large Language Models (LLMs) that are increasingly getting closer to parity with early leaders like OpenAI and Anthropic. However, these offerings are only sort of open with “innovative” licensing models that diverge from traditional open source approaches. Also, like all LLMs, they can sometimes hallucinate or generate harmful content.

This week, IBM has released its Granite 3.0, a new family of models meant to address both limitations. First, they are all made available under the more permissive Apache 2.0 license, meaning enterprises are free to use and modify it for commercial use. Rounded Wall Corner Guards

Open sourcing AI guardrails - IBM's push to improve safety and reduce hallucinations

It also introduced its Granite Guardian family of models, which are special-purpose LLMs designed to doublecheck other LLMs for safety and hallucinations & relevance. Previously, enterprises needed special-purpose models for one or the other. For example, Meta’s LlamaGuard can improve safety. Meanwhile, separate models such as Google’s T511B, Nightdessert’s WeCheck, and Bespoke Labs’ MiniCheck help reduce hallucinations and enhance relevance.

Kush Varshney, Ph.D, IBM Fellow, IBM Research, explains:

Granite Guardian stands out due to its comprehensive harm detection, including unique capabilities for hallucination risks in RAG. Jailbreaking detection is built-in, unlike competitors that need additional models for this purpose. These are areas where others fall short or require supplementary tools.

Hallucinations and safety are just part of the problem that enterprises need to consider when deploying LLMs. So IBM integrates Granite Guardian with enterprise AI governance platforms like watsonx.governance and Unitxt. Varshney says:

This infrastructure offers advanced features such as end-to-end monitoring, compliance workflows, and risk assessments, setting it apart from open-source solutions that lack similar governance tools. IBM’s open-sourcing of red-teaming datasets like AttaQ and SocialStigmaQA further enhances the model’s versatility.

Granite Guardian is designed to be run adjacent to other LLMs and retrieval augmented generation techniques that prime an LLM on specific documents or text. The larger billion-parameter models are more competent but slower and more costly. The smaller million parameter models can provide better performance and reduced cost for specific tasks, like identifying hate speech.

The Granite Guardian models act as a gatekeeper to LLMs, prompts, and training data. The model itself generates simple yes/no outputs based on its specification and configuration for content scoring. This can facilitate risk assessment for a model, improve model alignment, support online guard railing, and monitoring & observability.

Varshney says that in guard railing use cases, enterprises can take several approaches, including:

Organizations may also use the simpler models to filter out hate, abuse, and profanity from pre-training data at scale in a training scenario.

In the AI community, safety is a broad topic ranging from existential risks to humanity and everyday harm to enterprises, users and other stakeholders. The former leads to interesting arguments from closed-AI vendors arguing why regulators must lock out more open innovators. The everyday harms are more relevant for enterprises looking to deploy existing LLMs.

For example, Granite Guardian models can provide harm detection covering social bias, implicit and explicit hate speech, toxicity, violence, sexual content, unethical behavior, and jailbreaking. Most LLMs support guardrails internally that may be represented in the weights of millions or billions of parameters.

It can also identify jailbreaking efforts, where users try to get models to do something an enterprise might not want. For example, it might allow users to negotiate unaproved discounts, glean data from other users, or ask about unethical behavior.

In a safety context, Granite Guardian evaluates prompts LLM outputs and flags questionable ones for either a redo, failure mode or maybe hand off to a human if required. This provides an additional layer of protection beyond the core model itself. Enterprises can also customize the model by incorporating additional risks, allowing flexibility for emerging threats.

IBM compared its performance, identifying various harms against eleven benchmark data sets on the GitHub release page . The top granite model compared favorably against LlamaGuard on six metrics.

However, unlike LlamaGuard, Guardian supports additional guardrails for detecting different kinds of hallucinations relating to context relevance, groundedness, and answer relevance.

IBM compared Granite across the TRUE benchmark , which considers accuracy across eleven data sets. On average, it achieved 85% accuracy in detecting hallucinations, which was higher than Google and WeCheck models and slightly behind MiniCheck.

There are many caveats in comparing the performance of guard rail models for safety and hallucination mitigation. For one, the scientific community tends to use slightly different metrics for comparing performance on a scale of zero to one, using an F1 score for safety and area under the curve (AUC) for hallucination detection. Varshney elaborates:

The two metrics are similar and tend to be correlated, although they do have nuanced differences. One of the main differences is that the F1 score quantifies predictive performance at a single operating point (decision threshold), while AUC averages predictive performance across all operating points. We chose to report the harm detection performance with an F1 score and the hallucination detection performance with AUC because that is what is commonly done in the previous work. In our technical report coming out soon, we will report all metrics for all of the benchmarking we’ve done.

Also, it is important to appreciate how a guard rails model enhances the existing safety and hallucination measures baked into the underlying LLM. This is not so straightforward since the guardian model may create false positives or negatives. That said, the guardian model can significantly improve the safety and accuracy of responses.

IBM’s new open source guard rails models are entering a growing field of hallucination mitigation tools and approaches Diginomica has previously covered from TruEra , Galileo , and Vectara . Varshney says:

“Granite Guardian, along with proprietary vendor solutions such as TruEra, Galileo, and Vectara, is expected to thrive within a complementary ecosystem. Companies frequently adopt hybrid strategies, utilizing open-source components for fundamental guardrails while relying on vendor solutions to address specialized needs like compliance, monitoring, and end-to-end RAG optimization. What sets Granite Guardian apart is its comprehensive capabilities, enabling organizations to implement a transparent, robust enterprise solution that supports customized risk definitions and allows for easy deployment.”

IBM is also working on overarching guardrail solutions like its enterprise-grade watsonx.governance platform. This allows enterprises to mix and match different hallucination detectors. It is also working on broadening the scope of hallucination evaluation. It recently open sourced the WikiContradict benchmark dataset. It is also exploring other hallucination mitigation techniques beyond detection such as an episodic memory-based architecture called Larimar .

As for future work, Varshney notes:

IBM is committed to continually improving AI safety by developing custom risk detection features, scaling content moderation, and advancing hallucination detection. These plans reflect IBM’s broader strategy to build AI solutions that are both adaptable and enterprise-grade, catering to the needs of organizations seeking resilient, safe AI models.

The first wave of gen AI focused on improving the capabilities of LLMs. Proprietary AI vendors have made many strides on this front. The next wave of innovation that will provide the most value to enterprises will require improving the trustworthiness of AI models that make use of business data and which can be enhanced as necessary.

This includes the openness of the AI models themselves and the supporting governance infrastructure. IBM has made commendable progress with its commitment to fully open source licenses. Although many other vendors have released so-called open source models, a careful look at the licenses shows they are not truly open when it comes to commercial use beyond a certain threshold.

Also, the new guard rail framework will likely inspire further innovation to improve the safety and trustworthiness of AI apps in production. Some enterprises may use these tools to code apps themselves using various internal and public development tools. However, IBM is also growing a connected ecosystem of tools as part of its cloud, which could make this easier.

diginomica and the diginomica logo are trademarks of diginomica Limited.

Open sourcing AI guardrails - IBM&#039;s push to improve safety and reduce hallucinations

Open sourcing AI guardrails - IBM's push to improve safety and reduce hallucinations