QUELM

Reliability Intelligence for LLM-Powered Systems

Technical White Paper · March 2026 · Version 1.0

Georges Lieben · Tiemen Schotsaert

hello@quelm.ai · quelm.ai

in Review — not for distribution

Abstract

Large Language Model APIs have become critical infrastructure for a growing class of production applications. Yet the engineering discipline required to operate them reliably — regression testing, output validation, drift detection — remains largely unaddressed by existing tooling. Observability platforms record what happened. Format validators confirm structural compliance. Neither provides assurance that outputs are semantically correct, internally consistent, or behaviourally stable over time.

This paper presents Quelm, a three-layer reliability platform designed to address LLM output instability at the points where existing tools fail. The three layers — proactive regression testing, live traffic monitoring with automated test synthesis, and real-time output certification — operate independently or in combination. Together they constitute a complete reliability stack for production LLM systems.

The output certification mechanism (Layer 3) is the principal technical contribution. It draws on the verification model used in structured financial payment references to derive an integrity signal that is inseparable from the output itself, and independently recomputable without reference to a historical baseline. We distinguish this approach from prior LLM self-verification literature, which has established that models cannot reliably verify their own outputs using internal signals alone. Quelm's certification is explicitly external and deterministic, addressing the correlated hallucination problem that prior approaches cannot.

1. Introduction

The deployment of Large Language Model APIs in production software has introduced a category of reliability failure with no precedent in conventional software engineering. Traditional APIs are deterministic: a given input produces a predictable output, and deviations are surfaced as exceptions. LLM APIs are probabilistic: outputs are semantically plausible by construction, failures are silent, and the system that generates a wrong answer is indistinguishable — at the API boundary — from one that generates a correct one.

Three distinct mechanisms drive this instability in practice.

1.1 Provider-side model updates

LLM providers operate continuous update cycles across hosted model endpoints. Safety tuning adjustments, inference parameter changes, and system-level modifications are applied without incrementing public version identifiers and without developer notification. The Stanford/Berkeley longitudinal study (Chen, Zaharia, and Zou, 2023) documented this empirically: GPT-4's accuracy on a standardised prime number identification task declined from 84% to 51% over a three-month period with no version change. Directly executable code generation fell from 52% to 10% over the same window.

This phenomenon is not theoretical. In April 2025, a silent update to OpenAI's GPT-4o introduced extreme sycophantic behaviour within 48 hours of deployment. In August and September 2025, three Anthropic infrastructure bugs degraded Claude's output quality across approximately 30% of production calls for several weeks before public acknowledgement. In May 2025, Google's Gemini 2.5 Pro preview endpoint was silently redirected to a different model build without changelog entry or developer notice.

1.2 Prompt fragility

Production LLM prompts are typically developed and tested against a representative but finite set of inputs. The distribution of inputs encountered in live operation diverges from the development distribution in ways that are difficult to anticipate. A prompt engineered to extract structured contract data from standard commercial agreements will encounter, in production, non-standard jurisdictions, unusual clause orderings, scanned document artefacts, and multilingual content. Fragility — the rate at which output quality degrades outside the development distribution — is not currently measurable by any existing tool prior to production deployment.

1.3 The absence of a reliability infrastructure

Conventional software engineering has developed extensive tooling for catching regressions: unit tests, integration tests, continuous integration pipelines, and production monitoring with anomaly detection. These instruments assume deterministic, exception-raising systems. None translate directly to LLM outputs, which are probabilistic, semantically evaluated, and context-dependent. The result is that production LLM systems operate without the safety net that the rest of the software stack takes for granted.

Quelm is designed to close this gap. The following sections describe the problem model, the three-layer architecture, the technical foundations of each layer, the competitive landscape, and the research basis for the design choices made.

2. Problem model

We define three distinct reliability properties that a production LLM system should maintain, and which existing tooling does not currently guarantee.

2.1 Behavioural stability

A system is behaviourally stable if, for a given prompt and representative input distribution, output quality does not degrade over time. Behavioural stability can be violated by provider-side model updates, by changes to the system prompt, or by distributional shift in user inputs. Detecting violations requires either scheduled re-evaluation against a fixed reference set, or continuous monitoring of live output quality against an established baseline.

2.2 Semantic correctness

An output is semantically correct if its content — not merely its structure — is accurate with respect to the task. This property cannot be verified by format validators, schema checkers, or regular expression matching. An invoice extraction that returns well-formed JSON with an incorrect total passes every structural check while failing on the property that matters. Semantic correctness verification requires either comparison against a known-correct reference, or an external computational check against the output's own declared logical relationships.

2.3 Internal consistency

An output is internally consistent if the logical relationships between its fields are valid and non-contradictory. A contract summary in which the stated total value does not equal the product of monthly fee and contract duration is internally inconsistent, regardless of whether either figure is individually plausible. Internal consistency is the weakest of the three properties — it is necessary but not sufficient for semantic correctness — but it is the only one that can be verified deterministically without external ground truth.

Quelm's three layers address these properties in order of verification cost: Layer 1 (regression testing) addresses behavioural stability through scheduled re-evaluation; Layer 2 (live monitoring) addresses behavioural stability continuously from production traffic; Layer 3 (output certification) addresses internal consistency deterministically at the moment of generation, with a structural approach that provides partial evidence of semantic correctness.

3. Architecture overview

Quelm is deployed as a lightweight SDK agent that runs within the customer's infrastructure. No LLM traffic transits Quelm's servers. Customers provide their own API keys directly to providers (Bring Your Own Key). The agent intercepts calls locally, computes quality signals, synchronises anonymised metadata to Quelm's central intelligence layer, and triggers alerts where thresholds are crossed.

This architecture was chosen deliberately over the reverse-proxy model used by several existing observability tools. The proxy model introduces data residency obligations under GDPR and HIPAA, creates a third-party dependency in the critical path, and is structurally vulnerable to provider terms-of-service changes that restrict traffic intermediation. The SDK/sidecar model eliminates all three risks.

Layer	Name	Failure mode	Method	Baseline
1	Regression testing	Behavioural drift	Scheduled re-evaluation	Curated by team
2	Live monitoring	Prompt fragility	Continuous + auto-synthesis	Grows automatically
3	Output certification	Internal inconsistency	Deterministic recomputation	None

The three layers are designed to be deployed independently or in combination. Layer 1 provides immediate value for any team with an existing prompt library. Layer 2 provides continuous coverage that compounds over time as the test suite grows. Layer 3 is the novel contribution and is applicable from day one for any structured output use case, with no setup requirement beyond defining the output schema.

4. Layer 1 — Proactive regression testing

4.1 Design

Layer 1 operates in shadow mode, entirely independently of production traffic. The team assembles a golden set: a curated collection of prompt-input-output triples that represent approved behaviour across the system's critical paths. Quelm re-runs these prompts on a configurable schedule — nightly, on every CI/CD deployment event, or on detection of a provider model version change — and compares each new output against the approved baseline.

Comparison is performed across four signal types, each sensitive to a different failure mode:

Semantic similarity. Embedding-based cosine similarity between the new output and the golden output. Alerts trigger below a configurable threshold (default: 0.85).
Structural diff. For JSON, YAML, or Markdown-structured outputs, field-by-field comparison of schema compliance and value presence.
Assertion checks. User-defined boolean rules evaluated against the output: output must contain a specified string, response length must fall within a specified range, a specified field must be non-null.
LLM-as-judge scoring. A secondary model evaluates whether the new output fulfils the same intent as the golden output. Returns a score on a 1-5 quality scale with supporting reasoning.

4.2 Limitations and mitigations

The fundamental limitation of golden-set regression testing is coverage: the test suite can only detect regressions on inputs that have been explicitly anticipated and curated. The long tail of production inputs is not covered. Quelm addresses this limitation through Layer 2, which synthesises new test cases from live traffic automatically.

A secondary limitation is golden set staleness. As the underlying task evolves, the golden set must be maintained. Quelm provides tooling to flag golden outputs that have been superseded and to surface candidate replacements from recent high-quality production outputs.

5. Layer 2 — Live traffic monitoring and automated test synthesis

5.1 Design

Layer 2 operates as a continuous observer on production LLM calls. The SDK intercepts each API request and response, evaluates the output asynchronously against the current baseline, and logs quality signals to the local agent. No synchronous overhead is added to the production request path.

The distinguishing capability of Layer 2 — not offered by any comparable product — is automated test synthesis. The agent applies statistical change-point detection to the stream of quality signals, identifying outputs that fall outside the expected distribution for their prompt class. Flagged outputs are queued for human review and, on approval, promoted to the golden set.

5.2 Cross-provider drift intelligence

Because Quelm aggregates anonymised quality signals across its customer base, it can detect provider-side behavioural changes in aggregate before any individual deployment's quality metrics cross alert thresholds. Quelm issues a fleet-wide advisory within hours of a detectable change — before the first individual customer's support tickets arrive.

5.3 Integration

Layer 2 integration requires two lines of code:

import { quelm } from '@quelm/sdk'
const client = quelm.wrap(new Anthropic({ apiKey: process.env.ANTHROPIC_KEY }))

The wrapped client behaves identically to the original at the API boundary. No changes to calling code are required. PII scrubbing operates on the local agent before any metadata is synchronised externally; raw prompts and completions never leave the customer's infrastructure.

6. Layer 3 — Output certification

6.1 The verification problem

Layers 1 and 2 both rely on comparison — either against a curated baseline or against historical production norms. This approach has a structural blind spot: novel failures. An output that is internally inconsistent in a way that has never been observed before will not trigger a comparison-based alert.

A separate body of research has explored whether LLMs can be instructed to verify their own outputs. The findings are unambiguous: they cannot. Huang et al. (2023, ICLR 2024) demonstrated that intrinsic self-correction fails across GPT-3.5, GPT-4, and Llama-2. Stechly, Valmeekam, and Kambhampati (2024, ICLR 2025) showed that GPT-4 fails at both generating and verifying solutions on formal tasks. The March 2025 arXiv paper "Consensus is Not Verification" proved that no aggregation method using only internal signals consistently outperforms single-sample baselines.

6.2 The integrity-embedded verification model

Quelm's output certification mechanism is grounded in a verification paradigm distinct from comparison-based approaches: the principle of embedded integrity signals.

In well-designed structured data systems, validity is not asserted by the originating party and trusted by the receiver — it is independently computable from the data itself. A receiver who can derive the same integrity signal from the raw fields, and compare it against the declared signal, requires no prior knowledge of what a correct output looks like. The validity proof is intrinsic to the data, not extrinsic to it.

This paradigm is realised in checksum-based data integrity systems, where a function over a structured field sequence produces a value that any receiver can independently recompute. Corruption — whether introduced in transmission or at the source — is detectable without reference to the original uncorrupted record. Three properties make this verification model formally sound:

The verification function is deterministic. Given the same input fields, the function always produces the same integrity value. There is no probabilistic element in the verification step.
The verifier is computationally independent. The integrity signal is derived by a process separate from and unaware of the originating system's internal state. The verifier does not ask the source whether the data is valid — it computes validity from the data itself.
The scope is bounded. The verification function operates over a defined, structured field set. It does not attempt to assess semantic truth against external reality; it asserts only that the declared relationships hold within the data as presented.

The third property is as important as the first two. Checksum-based verification does not guarantee that the underlying data is factually correct — it guarantees only that the data is internally self-consistent with respect to a declared set of relationships. This is a weaker property than ground-truth correctness, but it is a stronger property than probabilistic plausibility, and it is achievable deterministically at generation time without external reference.

Quelm applies this principle to structured LLM outputs. The model is required to declare the logical relationships it asserts to hold between output fields — an embedded integrity signal. Quelm's certification engine independently recomputes those relationships from the raw fields. The engine is not a model call; it is a deterministic function. It cannot hallucinate. The only question it answers is whether the declared relationships hold over the actual values produced.

6.3 Quelm's certification mechanism

When Layer 3 is active, Quelm augments the system prompt with a certification instruction requiring the model to produce, alongside the primary output, a structured declaration of the logical relationships it asserts to hold between output fields. This declaration is the certificate.

Three certification levels are defined:

Level 1 — Logical consistency

The model declares that internal values are consistent: that line items sum to stated totals, that dates are in correct sequence, that referenced entities are named consistently throughout the output. Quelm recomputes each claim mechanically from the output fields.

Level 2 — Cross-field dependency validation

The model encodes the dependency graph between output fields as a set of verifiable rules. A contract summary in which total_value does not equal duration_months times monthly_fee fails immediately, regardless of how confidently the model stated the total.

Level 3 — Deterministic fingerprint

A canonical checksum is derived from the output's key fields in a defined, schema-specific order. The model is instructed to compute and declare this fingerprint. Quelm recomputes it independently. Because the model cannot predict precisely which fields Quelm will include in the canonical ordering, consistent hallucination across both the output content and the fingerprint is statistically improbable.

6.4 What certification does and does not guarantee

Certification Level 1 and 2 guarantee internal consistency — that the output does not contradict itself. They do not guarantee factual correctness. An invoice that fabricates a product at a plausible price, with line items that sum correctly to a plausible total, will pass Level 1 and Level 2 certification. The fabrication is not internally detectable.

This is a deliberate scope boundary, not a design flaw. Internal consistency is a necessary condition for a correct output. An output that fails certification is provably wrong in a specific, articulable way. An output that passes certification may still be factually wrong, but it is not self-contradictory.

6.5 Distinction from prior self-verification literature

The critical structural distinction between Quelm's certification mechanism and the LLM self-verification approaches documented in the academic literature is the locus of verification. In self-verification approaches, the same model that generated the output is asked to evaluate it. Quelm's recomputation step is not a further model call. It is a deterministic computation performed by Quelm's certification engine on the structured fields of the output. The engine cannot hallucinate. The model cannot predict exactly what the engine will check.

7. Competitive landscape

The LLM observability and reliability market divides into four functional clusters, none of which individually addresses the full problem set that Quelm targets.

7.1 Observability platforms

Langfuse, Helicone, Arize Phoenix, and Weights & Biases Weave provide request logging, latency tracking, cost attribution, and trace visualisation. These tools answer the question: what happened? They do not provide scheduled regression testing, output quality scoring, or any form of output certification.

7.2 Evaluation frameworks

Braintrust, DeepEval (Confident AI), LangSmith, and Patronus AI provide LLM evaluation capabilities. Braintrust is the closest existing product to Quelm's Layer 1. No evaluation framework offers automated test synthesis from production traffic, cross-provider drift intelligence, or output certification.

7.3 Gateway and proxy products

LiteLLM, Portkey, Cloudflare AI Gateway, and Kong AI Gateway provide routing, failover, semantic caching, and cost management at the API gateway layer. These products address infrastructure concerns, not output quality. Quelm deliberately does not compete in this space.

7.4 Structured output validators

Outlines, Instructor, and Guardrails AI enforce structural conformance — JSON schema validation, Pydantic type checking, automatic retry on format violations. These tools guarantee that an output has the right shape. They do not verify that the values within that shape are correct, consistent, or stable over time.

7.5 Quelm's positioning

Quelm occupies the intersection of two gaps: the gap between observability (what happened) and assurance (was it correct), and the gap between structural validation (right shape) and semantic validation (right content). No existing product occupies this intersection.

8. Target applications

Quelm's reliability stack is applicable wherever LLM outputs drive consequential downstream actions.

8.1 Financial document processing

Invoice extraction, financial statement summarisation, and contract data extraction are among the highest-stakes LLM applications in production today. Layer 3 certification catches internal arithmetic errors and cross-field inconsistencies at the moment of generation.

8.2 Legal and compliance workflows

Contract review, regulatory compliance checking, and legal document summarisation require outputs that are internally consistent and accurately cross-referenced. Quelm's certification layer provides a deterministic check on internal cross-reference consistency.

8.3 Clinical and healthcare data processing

Clinical note extraction, medication reconciliation, and diagnostic coding are LLM applications where output errors have direct patient safety implications. Quelm's HIPAA-compatible architecture — no data egress, customer-controlled deployment — satisfies healthcare enterprise requirements.

8.4 Customer-facing automation

Support bots, sales automation, and customer communication workflows require behavioural stability. Layers 1 and 2 provide the scheduled regression testing and continuous monitoring required to detect silent behavioural drift.

9. Research foundations and prior art

The following papers are foundational to the design of Quelm's output certification mechanism.

Huang et al. (2023) — LLMs Cannot Self-Correct Reasoning Yet

ICLR 2024 (Google DeepMind)

openreview.net/forum?id=IkmD3fKBPQ

Demonstrates that intrinsic self-correction fails across GPT-3.5, GPT-4, and Llama-2. Primary basis for Quelm's decision to use external recomputation.

Stechly, Valmeekam & Kambhampati (2024) — On the Self-Verification Limitations of Large Language Models

ICLR 2025

arxiv.org/abs/2402.08115

Establishes that GPT-4 fails at both generating and verifying solutions on formal tasks. Self-verification does not improve with model scale.

Anonymous (2025) — Consensus is Not Verification

arXiv, March 2025

arxiv.org/html/2603.06612

Proves that no aggregation method using only internal signals consistently outperforms single-sample baselines. Models agree with each other more reliably than they agree with truth.

Kambhampati et al. (2024) — LLMs Can't Plan, But Can Help Planning in LLM-Modulo Frameworks

ICML 2024 Spotlight

proceedings.mlr.press/v235/kambhampati24a.html

Proposes the LLM-Modulo framework: LLMs generate candidates, external model-based verifiers check them. Quelm follows this template.

Chen, Zaharia, and Zou (2023) — How is ChatGPT's Behavior Changing over Time?

arXiv

arxiv.org/abs/2307.09009

The empirical foundation for the provider-side drift problem. Documents measurable performance degradation across GPT-4 and GPT-3.5 with no version change.

10. Conclusion

The production reliability gap in LLM-powered systems is a structural problem, not an operational one. It cannot be closed by more careful prompt engineering, more thorough pre-deployment testing, or closer monitoring of observability dashboards. It requires a dedicated reliability layer that operates continuously across the full output lifecycle.

Quelm addresses this gap through three complementary mechanisms, each targeting a failure mode that the others cannot reach. Layer 1 provides the scheduled regression safety net. Layer 2 provides the production coverage that compounds with usage. Layer 3 provides the one capability that the existing research landscape has not previously offered in a production-ready form: an externally computed, deterministic integrity signal that is inseparable from the output itself.

The correlated hallucination problem — the reason why self-verification approaches fail — is not a temporary limitation of current model capabilities. It is a structural consequence of the way large language models generate text. Quelm's certification layer is designed around this constraint from first principles, drawing on the verification model that structured financial payment systems have used for decades: embed the proof in the data, compute it externally, reject any mismatch.

The result is a reliability infrastructure that is, for the first time, commensurate with the stakes of the systems it serves.