Building Anti-Hallucination Patterns

Over the past several weeks, I've been building out a safe and effective platform that helps everyone safely and effectively answer technical requirement questions in a tailor-built environment that abstracts a way the complexity of working with Agents.

In benchmarking exercises, my tool ended up being 47% cheaper and 51% faster than working with standard MCP tools. Importantly, it also entirely avoided the pitfalls of MCP tools leaking senstive internal information (like internal project names). Using mid-tier models with MCP tool access, I was finding up to 87% leak rate across a list of 134 example questions.

Achieving those two behaviours reliably took a lot of trial and error. It's 'safe' because it only has access to a library of cited public sources, and it's effective because it's ring-fenced with anti-hallucination patterns. Here's how it was done.

1 - A Compressed Reference Library is not a Limitation

It's been demonstrated several times over that models with large volumes of training data learn to 'always have an answer'. As we saw most recently in GLM-5.2, a significantly smaller model is less likely to be confidently incorrect than their frontier model counterparts.

In my project, I took the approach of stripping away tool access from any agents working within the platform, and the boundaries within which they operate is comprised of only about 300 .md files. Obviously, the models themselves have their own intrinsic training data (I'll talk about that in a sec), but as far as they understand, they're only allowed to reference that limited set of resources to find their answers.

Without access to the Web, or any internal tools, we get much more reliably effective answers from Agents. I can perform 10 independent runs across the same questions list, and while the written output may be different the underlying facts of the answer will be consistent - exactly what you want out of a high-trust system.

2 - A Models' Training Leaks Out

One of the ways in which my platform works to ensure the relative safety of it's answers is that each one must contain a publicly available url citation from which the reference content was sourced. Each answering agent works like an independent librarian, looking up relevant references in the top-level index files, finding the right path to the 'books' it needs to read to source it's answer, and then citing those sources as it passes back an answer.

What I've found, especially for the Anthropic Opus models, is that the models' intrinsic training data can contain information that was true at the time the model was being trained, but may not be true anymore. In some tests, Opus models were trying to pass off answers that contained URLs that may have been active in 2023, but now 404 when trying to visit them.

The only reliable way I've found to catch these was to implement a 'retry' script that double-checks the cited URLs within every answer that an agents tries to pass off against an 'allow-list' that I know for sure is valid. This static retry block routinely catches false-true URLs that an Agent tries to slip past the rules.

3 - Solving Inaccuracy in the Long-Term

Keeping an updated reference library within which the Agents are allowed to operate is only half of the solution; once you have a reliable method of automatic self-healing to keep sources up-to-date, you need to think about a way to validate that the self-healing was successful.

In my case, I deployed a visual explorer of the index to be able to ask targeted questions after intaking new information into the sources.

This can be used to ensure a topic is being accurately referenced by Agents using the sources. For example, after a major release's information was sourced and built into my platform's index, I could target specific questions about the new information that would tell me wether Agents were successfully finding the new facts or not. I've caught several missteps this way, where new information was either confusing to Agents or was never making it into answers as citiations.

Each new Index deployed to the platform gets a unique ID, and every Agent stream that's composing answers writes the Index version it was using to the stream file.

Basically, this means that if an Index version is found to be 'unhealthy' in some way (eg. it contains some incorrect or out of date information about a topic), I can deploy a new healed index and say to the system 'hey, check any runs that used Index [ID] for ____ topic and correct those answers as necessary'.

4 - Agents Get Confused by Contradictions

One of the trickiest user features to build into my platform was the 'Playbook' that teaches Agents how to answer every question identified in a document. Users expect to be able to add instructions like 'keep answers short' or 'make sure to include details about headless hosting' and have Agents keep those details in mind when sourcing answers from the reference content and writing any answers.

However, what happens when you allow for free-form instructions is that contradictions emerge between all the rulesets that the Agents are trying to follow.

You can't walk into a restaurant and say 'yeah give me the spaghetti and meatballs, but hold the noodles, hold the meat, and make sure it's a pizza'.

An intrinsic system prompt baked into the model being used might contain 'be accurate', while my platforms' boundaries contain 'be concise', and a user-set playbook could contain 'be enterprise-focused'. That's 3 set of instructions that can't all be true at the same time for every question. An agent might convince itself that to be accurate and enterprised focused requires two paragraphs, negating 'concise' by necessity.

Similar to the 'retry' allow list for canonical URLs, I eventually had to put in a strict ceiling for response lengths. Numbers are respected verbatim, while looser 'be concise' language is standardised to a 'short/medium/long' instruction to the answering agents. If they go over a ceiling, they'll be asked to re-write. In terms of contradicting instructions, a hierarchy was needed to teach the agents which instruction source to respect if any contradictions exist.

Overall, I had an absolute blast experimenting and troubleshooting my way to a reliable technical answering platform! Any questions, send 'em my way. ✌️