FAITHFUL[quote is real]+RELEVANT[on topic]+SUPPORTIVE[backs the claim]
Thesis: make verifiable citation the product. This deck follows its own rule, every claim and figure carries a visible source. The baseline (faithful+relevant) answers the case; the extension (supportiveness+intent agent) is the new solution.
RECAP · THE BRIEF
WHAT WE WERE ASKED
Take the research agent from prototype to trusted every day
REMATIQ is a MedTech compliance platform in three layers: the compliance graph, workflows, and a general-purpose research agent. This case is only about the third layer, the one that handles long-tail Q&A and one-off documents.
From REMATIQ's site. The research agent is the top-layer Documentation Agents, reading and writing on the Ontology and Audit Graph. REMATIQ.COM
REMATIQ's real architecture. The research agent = top-layer Documentation Agents, reading/writing on the mid-layer Traceability Engine (Life Science Data Ontology + Audit Graph). Verifiable citation lands on the ontology's typed links and the audit graph, in their own words, no need to borrow Palantir.
OVERVIEW
THE ARGUMENT, IN ONE BREATH
A citation is a contract the backend must honor.
First confirm the industry baseline, then the solution proven in production, then name the frontier, then the solution and deliverables. Baseline and extension stay clearly separate.
ANSWERS THE CASE · BASELINE
Verifiable citation is the industry floor; faithfulness and relevance are largely solved
reduce-hallucination ships these two layers in production
EXTENSION
The real frontier is supportiveness, and it depends on the user's intent
Solved with an intent agent and “surface, not decide”
better-doc: lead with one thesis, then the route. Keep baseline (answering the case) and extension (new ideas) apart so focus doesn't blur.
CASE · BASELINE
THE FLOOR, ALREADY SOLVED
An answer without a source is nearly useless.
This is the floor in regulated industries. Medical and legal AI long ago made “answers with clickable sources” standard: each sentence anchors to the source, hover to preview, click to verify.
OpenEvidence
Clinical Q&A; cites peer-reviewed sources sentence by sentence; declines when unsupported OPENEVIDENCE
Harvey
Legal; cites the specific clause/paragraph, click back to verify HARVEY
So “showing citations” is just the entry ticket; the real moat is backend verification. It solves the first two pillars: faithfulness (the quote is real) and relevance (on topic).
Concede this is the baseline; don't sell the entry ticket as innovation. This page also introduces the first two of the three pillars.
CASE · BASELINE
PRODUCTION CANON
reduce-hallucination: take the first two pillars into production
It borrows proven techniques from interrogation science for getting a knowing witness to tell the truth PEACE · SUE: treat each LLM node as a witness, build five gates, and validate in production 2,818 TASKS. At this point, faithfulness and relevance are both solved.
1Schema keeps an abstention exit: output null when unsure, never guess
2The {value · source · verbatim quote} triple
3At the exit, code-check the quote exists verbatim (zero LLM cost)
4Label provenance: stated · inferred · absent
5Verification must be real: asking for citations without checking teaches the model to fake more convincing ones
“Ask-and-actually-check” effect g≈0.80; in the same prompt, fields with an abstention exit had zero corrections. VERIFIABILITY STUDYPRODUCTION A/BOPEN SOURCE · GITHUB
This is the unfair advantage vs other candidates: not theorizing about citation, but having shipped this mechanism in production (speak in first person). Make clear: borrow interrogation science → five gates → production validation → faithfulness+relevance solved.
EXTENSION
THE FRONTIER · THIRD PILLAR
Supportiveness: does the cited passage actually support the claim?
Real and relevant does not mean supportive. A citation that resolves but does not support the claim (misgrounding) is worse than none, because it manufactures false trust. Support vs contradiction depends on which direction the user argues.
Supportiveness = Stance × Intent
0%
link valid / relevant · looks fine
39–77%
actually supports the claim · fails in substance
SOURCESCITED BUT NOT VERIFIED · 2026STANFORD REGLAB · MISGROUNDINGSCITE.AIALCE · partial-support limit
This is the extension, so it's badged EXTENSION, with sources on the table, the “verifiable” idea applied to my own answer. 94% vs 39–77% is the key figure.
EXTENSION
TWO AGENTS · PROACTIVENESS
Beyond answering, a second agent that infers intent
EXECUTION[answer]INTENT[reads rings]
The execution agent grounds the answer; the intent agent starts from the current question and pulls in context ring by ring to infer what the user is really arguing.
① Question · what's being asked
② Session · answers / draft / revisions
③ History · the user's past choices
④ Team & product · how others write / existing content
⑤ Scene · the wider context
Combine the layers to infer intent, then decide which citation to use and resolve supportiveness with both directions. When intent is unclear, show both and let the human choose. Design after KeyCite: direction is a review flag, not a verdict. WESTLAW KEYCITE
Two agents: execution (answer) and intent (reads rings). The intent agent's input is context expanding outward: question→session→history→team/product→scene. Combine to infer intent, decide which citation, and how to resolve supportiveness.
EXTENSION · A WORKED EXAMPLE
WORKED EXAMPLE · no UI, just the process
Draft · BP Monitor risk management file
“SOP-042 meets ISO 14971’s residual-risk requirements.”
① Faithful + relevant · reduce-hallucination
Retrieve ISO 14971 §8 and SOP-042 §6; on topic. Five gates: ✓ the triple (value · source · verbatim quote) ✓ verbatim check: ‘residual risk shall be evaluated’ does exist in §8 ✓ provenance = stated ✓ real check passed → faithfulness + relevance solved
② Bring in context · intent agent
Read session + draft: this sits in the ‘gap assessment’ section, in a self-assessing tone.
Pull in history and scene → Intent unclear: prove compliance, or find the gap?
→ unclear, don't decide for the user
③ Supportiveness · two citations, user chooses
▲ Supports ISO 14971 §6 ‘risk control’ → backs ‘meets’ ▼ Contradicts §8 requires post-market residual-risk monitoring; SOP-042 §6 has no such step → exposes a gap
→ whichever is picked anchors the deliverable and feeds back as an intent signal
Same draft sentence: real and relevant; but which direction it supports depends on intent. Finding the opposing evidence is gap analysis. (Clause numbers are illustrative; the real UDM paragraph governs.) ISO 14971 · exampleSOP-042 · example
End-to-end example: first reduce-hallucination's five gates solve faithful+relevant (each checked), then the intent agent brings in more context, then two opposing citations for the user to pick. No UI, just the process. Clause numbers are illustrative, a reminder not to fabricate.
DELIVERABLE · ONE
REPRIORITIZED USER STORIES
Citation is the spine, yet the PRD filed it under NICE; promote it to P0
P0 · spineclickable citations · jump to section/paragraph · verbatim-quote check · abstention verdict (not in library / out of scope)
P1progressive-disclosure answers (short claim + chip + expand, fixes “too long”) · deliverable as its own doc, PDF export
deferredimage support · DOCX · multi-doc chat · full save-to-library (but citations carry a version stamp from now on)
Basis: both citation stories are NICE and unbuilt in the PRD, while positioning and strategy treat “verifiable” as core. The “too long” feedback is structural; progressive disclosure fixes it. PRDPILOT FEEDBACK
One of the deliverables answering their question. Reprioritization: promote the misfiled-as-NICE spine to P0, and use progressive disclosure to solve the #1 complaint.
DELIVERABLE · TWO / A
ALIGN WITH STEFAN · CUSTOMER NEED
Align with Stefan (1): validate the customer-need assumptions first
Alignment = come with a judgment to confirm or refute, not open-ended questions. Frame each question as “which product decision does it settle for me”.
Need · length“Answers too long”: is the real need less information, or one-click verify then expand? Validates: progressive disclosure vs blunt truncation · evidence: revisit Marcel / Paul's langfuse traces
Need · costHow much costlier is a confident wrong answer vs an honest abstention for an RA? Validates: how aggressive abstention should be · a wrong conclusion in a submission costs far more than “not in the library”
Need · two-wayShowing both supporting and contradicting evidence: does it feel powerful, or like the tool is unsure? Validates: whether the supportiveness feature is worth building, and how to present it
Need · priorityWhich area do pilot customers actually push on (regulatory Q&A / generation / gap)? Validates: whether my area-priority order is right
I bring not a question list but a judgment plus a set of assumptions for Stefan to confirm or refute.
Customer-need side. Translate vague feedback like “too long” into concrete product decisions Stefan can confirm/refute. This page makes clear what I align on, why, and with what evidence.
DELIVERABLE · TWO / B
ALIGN WITH STEFAN · ML & DATA
Align with Stefan (2): can the data and ML support the citation contract?
ML feasibility
Retrieval granularity: can we reliably retrieve at UDM paragraph / span level?
Verbatim verification: deterministic string-match against UDM; at what point does OCR normalization need a bounded fuzzy matcher? (the cost fork)
Stance classifier: do we have / can we build claim-evidence entailment? Is it accurate on conditional regulatory language?
Intent agent: ML (embedding / clustering over the session) or just a prompt?
How is the abstention threshold calibrated? What triggers it?
Data structures
Does each UDM paragraph have a stable, resolvable, version-stamped anchor?
Are typed links queryable at answer time? (so ‘inferred’ shows the real relation chain, not a vector guess)
Is revision / draft history recorded and accessible? (the intent agent's lifeblood)
Is org / project scope enforced at the data layer?
In one line: if these hold, the spine ships in v1; if not, fix the data first, don't build flourishes.
ML + data side. The crux: can data and ML support our “verifiable + supportive + intent-aware” citation contract. OCR normalization is the one cost fork to confirm.
DELIVERABLE · THREE / A
ALIGN WITH ANTON · COST
Align with Anton (1): get the real cost per item; open with the cost asymmetry
Cheapthe triple · provenance badge · abstention verdict (schema / prompt only) “The model already retrieves the spans; this is just an output-format constraint, no new infra”
Cheap–medverbatim string-match against UDM by anchor; deterministic, zero LLM “Medium” only if OCR text normalization needs a bounded fuzzy matcher, the one estimate to nail down with Anton
Med · +1 LLMstance / cross-examiner node only on sign-off, high-stakes answers, not every lookup
Pricey · uncertainintent agent needs Anton to scope: async over session logs? on the existing background-execution layer? latency / cost / data dependencies?
The argument to Anton: most of the spine is cheap; the pricey half (the verifier) is exactly the line between a demo and a trusted tool. Skip it and the product gets more dangerous, not just less impressive.
Engineering-cost side, item by item. Persuade via the cost asymmetry: win the spine cheaply, frame the pricey parts honestly. OCR normalization is the one estimate to nail.
DELIVERABLE · THREE / B
ALIGN WITH ANTON · ITERATION & RISK
Align with Anton (2): sequence v1 / v2 / v3 by cost, and lay out the risks
Iteration path (by cost)
v1 · cheapfaithful + relevant + both-directions seed. verbatim match + abstention + both directions (logged, not yet fed back)
v2 · meddistill a fine-tuned NLI model (much lower cost / latency) + intent agent + closed learning loop (guards the contradiction class against confirmation bias)
v3 · priceyself-check + graduated autonomy + GxP validation / audit trail (mostly governance, not model work)
Engineering risks I raise
determinism on messy OCR docs (→ fuzzy-match threshold)
the intent agent's data dependency (are revisions logged)
proactivity × access-control intersection
version correctness: when a source changes, what happens to old citations
One-line cost story: v1 is a prompt and a schema; v2 is one cheap distilled model plus a pricey intent agent; v3 is mostly governance.
Iteration + risk. Let Anton sequence by cost, and proactively surface the engineering risks I've considered, so it isn't hand-waving. The one-line cost story is easy to remember.
PROTOTYPE
A CURSOR FOR COMPLIANCE DOCS · CLICKABLE DEMO
A Cursor for compliance docs: agents on the left, document on the right, verification in the middle
L AGENTS · SESSIONS
Multiple compliance tasks in parallel (like Cursor sessions) · bottom-left always-on intent agent: shows what it read, what was added to input, the current intent
M RUN LOG + CHAT
Grounded trace: read → verbatim verify ✓ → judge stance → both directions · evidence folded into an auditable log · click a citation to peek the source
R DOCUMENT EDITOR
The generated compliance doc, editable, with version history · revisions feed the intent agent · attest = attestation
No backend, one scripted case (BP Monitor / SOP-042 / ISO 14971). Verification is a real JS verbatim string-match. ▶ LIVE DEMO · /en/demohighlight · #go
A Cursor / Claude Desktop for compliance docs. Left agent sessions + always-on intent; center run log (verification folded into the audit trail); right versioned document editor. One case shows it all. Use /en/demo live; #go jumps to the highlight frame.
FEATURE WALKTHROUGH
6 capabilities · 6 requirements (captured from the live demo)
Grounded run · verbatim verify ✓ · both directions, you choose Req: faithfulness + relevance + supportiveness
Click citation → source span highlighted & verified / open full doc in split Req: click back to verify (regulation down to paragraph)
Paper-style editor · version history · semantic buttons Req: editable deliverable + lifecycle
Version diff (green add / red delete) Req: revisions traceable · fed back to the intent agent
Send to Workflow · spawns a structured session Req: three-layer linkage (research agent = platform on-ramp)
Feature walkthrough: 6 shots from the live demo, each ‘feature → requirement’. A static, readable capability overview for the panel, no need to drive the complex demo live. See /en/demo or the tour for detail.
UI RATIONALE
WHY THIS LAYOUT · ON THE AGENT-IDE PARADIGM
Why this design: borrow the Cursor / Claude paradigm, add a compliance-only layer
Borrowed · agent-IDE paradigm
Three columns: left agent sessions / center conversation+run / right artifact editor CURSOR · CLAUDE DESKTOP
Parallel sessions + model picker + @-mention docs CURSOR · CLAUDE
Cite to a specific paragraph, click to verify HARVEY
Atomic claims, each with provenance HEBBIA
Per-block accept + version / diff GEMINI IN DOCS · CURSOR
The compliance-only layer we add
Every claim is verbatim-verified backend; abstain if ungrounded (Cursor verifies code, not facts)
Always-on intent agent: keeps inferring what you're arguing, proactively offers both directions (Cursor / Claude don't)
Everything auditable: run log + version history + attestation → Audit Graph
Cursor / Claude Desktop proved the agent-IDE interaction works; bring it to compliance and add “verifiable + intent-aware + auditable”. That is a Cursor for compliance docs.
Answers “why this design”: on the Cursor / Claude agent-IDE paradigm (left agent / center chat / right artifact), add three compliance-only things: backend verification, an always-on intent agent, end-to-end auditability. No longer NotebookLM-led.
THE ONE TAKEAWAY
A verifiable source is the foundation of trust.
The baseline already ships in production; the frontier, supportiveness, is solved with the intent agent. Make the near field solid, advance the rest by v1 → v2 → v3.