Alexandros’s Substack

Global RAG Chatbot for internal compliance teams

Alexandros Zenonos — Tue, 24 Mar 2026 15:32:24 GMT

A compliance assistant is not “an LLM with documents”. It’s a controlled information system sitting inside a risk function.

If you treat it like a demo, it will behave like one, and the first serious user will break it.

What matters is not how fluent it sounds. What matters is whether it can be held accountable.

Start with boundaries, not prompts

Before building anything, you need to answer:

What is this allowed to do?
What is it explicitly not allowed to do?
What happens when it cannot answer safely?

A useful baseline:

it can quote and explain policy with citations
it can compare two internal rules if both are retrievable
it cannot give personal legal advice
it cannot “interpret intent” beyond the policy text

This is where most projects die, because nobody wants to be the person who says “no”.

The core requirement: an auditable trail

Compliance users do not want magic. They want traceability.

At minimum, the system needs:

retrieval trace: which sources were used (IDs + versions)
answer trace: which parts of the response map to which sources
system trace: prompt/model/version used
refusal trace: why it refused, and what the escalation path is

If you cannot show the trail, you are not building a compliance tool. You are building a liability.

Failure modes to design for (because users will find them)

1) Judgement-shaped questions

Typical user prompts include:

“Can I do X?”
“Is this allowed?”
“Is this risky?”

You need explicit handling:

classify risk tier
answer only what the policy says
route judgement to humans

2) Policy conflict

Two documents say different things. The system must surface conflict and cite both, not “average them”.

3) Missing context

If retrieval is weak, answers become fiction. The correct output is refusal plus the next step.

Decisions that mattered

Short answers, structured by default.
Citations are not a feature. They are the product.
Retrieval quality is the key metric, not “LLM accuracy”.
Change control: policy updates must be versioned and testable.

What broke (and what we changed)

Users tried to get the bot to “approve” decisions. We tightened refusal modes and added explicit escalation prompts.
Policy updates changed answers unexpectedly. We moved to strict versioning and regression checks.
Early outputs were too smooth. We forced “evidence-first” behaviour even if it looked less impressive.

Takeaways

If you want a compliance assistant, you are building governance plus engineering. The model is the easy part.

If your project is stuck at “PoC but nobody will sign off”, the fix is usually boundaries, auditability, and test gates, not a better prompt.

Months to minutes: the boring engineering behind genomics NLP

Alexandros Zenonos — Sun, 08 Mar 2026 13:51:44 GMT

Genomics is full of “AI potential” and short on operational reality.

The bottleneck is rarely the model. It’s the plumbing: unstructured reports, inconsistent formats, brittle hand-offs, and a review process that does not scale.

In one NHS-facing workflow, we took manual extraction that was effectively measured in weeks or months and pushed it down to minutes for first-pass structured output. Not by doing something clever. By doing the unglamorous work properly.

The problem was never “PDF extraction”

The actual problem was end-to-end reliability:

Reports arrive in different templates.
Key entities are missing or phrased differently.
Clinical users need traceability, not “best effort.”
The system has to fit into a hospital data boundary.

If you only solve text extraction, you still fail at adoption.

What we built (bounded version)

A pipeline with four hard constraints:

Deterministic ingestion (no silent failures).
Traceable extraction (what was found, where, and how).
Standardised output (so it can actually be used).
Review loop (so clinicians can correct and improve).

The NLP is a component. The product is the workflow.

Decisions that mattered

1) Standardisation beats bespoke “perfect extraction”

We treated structured output as the target, not raw text. Instead of just scraping “BRCA1 positive,” we had to map complex phrases like “Pathogenic variant identified in BRCA1 c.68_69delAG” into a standardised schema linking the gene, variant, and clinical significance.

That means:

Mapping into a consistent schema.
Keeping document provenance per extracted field.
Supporting partial results instead of all-or-nothing.

2) Build an audit trail early

If you cannot answer “why did the system say this?”, you will lose trust fast.

So every extracted element needed:

Source reference (document + location).
Confidence / extraction method metadata.
Versioning for pipeline changes.

3) Make the review loop part of the design

Clinicians do not want another dashboard that’s “interesting”.

They want:

A queue they can work through.
Fast correction.
Clear uncertainty flags.
A feedback mechanism that improves the system.

4) Treat data boundaries as design constraints

Healthcare data is not a playground, especially when navigating NHS data governance. You can’t just pass personal data to a public API.

So the default posture was:

Minimise what leaves the boundary (operating within secure, approved environments).
Store only what is necessary.
Document exactly who can access what, and why.

What broke (and what we changed)

Template drift: Upstream report formats changed. We moved to robust pattern handling and added detection when a template looks “new”.
Edge cases: Rare variants and unusual phrasing. We stopped trying to be perfect and focused on triage: surface uncertainty and push to review.
Overconfidence: Early outputs looked clean but hid ambiguity. We forced the system to show uncertainty explicitly.

Takeaways

Start with the workflow and data boundary if you want “AI in genomics” to actually deploy.
Standardisation and auditability are your fastest wins, not a fancier model.
Clinician review is not a fallback. It’s a core feature of the system.

If you’re sitting on unstructured clinical reports and calling it a data science problem, you’re already late. It’s a delivery problem.

Have you hit similar walls deploying NLP or AI in healthcare? Let me know in the comments below.

Coming soon

Alexandros Zenonos — Sat, 31 Dec 2022 00:07:05 GMT

This is Alexandros’s Substack, a newsletter about Artificial Intelligence and more.

Subscribe now