About Quelm

Reliability intelligence
for production AI.

We built Quelm because the tooling layer for LLM-powered products has a gap that no observability dashboard, format validator, or gateway product closes: the gap between what happened and whether it was correct.

The problem we're solving

Building on LLM APIs is fundamentally different from building on any other API. A REST endpoint either works or it doesn't. An LLM endpoint returns something plausible-looking every time — even when the behaviour has silently changed, the output contradicts itself, or the model you're calling today is not the model you called last month.

In April 2025, a silent update to GPT-4o introduced extreme sycophantic behaviour within 48 hours — without a changelog entry, without a version bump, without any alert to the teams whose products broke. In August 2025, Anthropic infrastructure bugs degraded Claude's quality across 30% of production calls for several weeks before any official acknowledgement. Google silently redirected a dated model endpoint to a different build with no notice.

These are not edge cases. They are the normal operating conditions of LLM production systems in 2026. The teams building on these APIs deserve the same reliability infrastructure that every other part of the software stack takes for granted: regression testing, live monitoring, and real-time output validation. That infrastructure does not exist yet. Quelm is building it.

How the platform works

Quelm operates across three layers, each targeting a different failure mode at a different point in time. Layer 1 fires on a schedule before any user sees a change. Layer 2 fires asynchronously after every live response. Layer 3 fires inline at the exact instant of generation.

Quelm reliability stack

Three layers · one SDK · no proxy

Layer 1 — regression testingproactive · scheduled

scheduled run · 04:00 UTC

last 7 runs

MS↑

baseline · cosine similarity

prompt_001 cos_sim 0.97

prompt_002 cos_sim 0.71 !

... ...

prompt_003 cos_sim 0.37 !

2 regressions detected · alert dispatched

regression suite

golden set: 47 prompts

passed: 45

failed: 2

+ auto-promoted

3 new from traffic

↓

Layer 2 — live traffic monitoringreactive · async

SDK integration

npm install 'quelm'

import { quelm } from '@quelm/sdk'

const client = quelm.wrap(

new Anthropic({ ... })

)

live traffic

→

agent

observepassflag

24h traffic summary

12,847total calls

3flagged

99.98%clean

provider drift signal:

fleet advisory issued

detected 5.4h before alerts

↓

Layer 3 — output certificationinline · baseline-free · deterministic

output fields

duration_months: 12

monthly_fee: 2500

total_value: 30000

date_start: 2026-01-15

date_end: 2026-04-15

12 × 2500 = 30000 → ✓

date_start < date_end → ✓

span = 3 months ≠ 12 declared → ✗

certification engine

recomputing declared relationships...

✓ arithmetic: 12×2500=30000

✗ date span ≠ declared months

fingerprint:

sha256: 9f3a...c7e1match

certification failed · 1 violation

Our approach

SDK, not proxy

Quelm runs as a lightweight agent inside your infrastructure. No traffic routes through our servers. Your API keys stay with you. GDPR, HIPAA, and SOC 2 compliance is met by architecture, not policy.

Cross-provider

Native provider tooling cannot instrument its own silent updates. Quelm aggregates anonymised signals across customers and providers simultaneously — detecting fleet-wide drift hours before individual teams notice.

External verification

Layer 3 certification is not a further model call. It is a deterministic computation. The same system that generates errors cannot generate the certificate that catches them — that is the architectural guarantee.

Who's behind Quelm

Georges Lieben

Co-founder

Georges has spent his career building companies at the intersection of technology, energy, and automation. He co-founded June Energy, a smart energy platform that automatically switches 20,000+ Belgian households to the best available tariff — an early exercise in deploying reliable, autonomous decision-making at scale in a regulated, high-stakes environment.

He has followed the evolution of generative AI from the beginning and has been integrating LLMs into product workflows since the early API era. His writing focuses on what he calls the shift from AI-as-conversation to AI-as-execution: the move from chatbots to autonomous operational layers. That shift is precisely what makes LLM reliability a first-order infrastructure problem — and what convinced him to build Quelm.

His companies today employ over 100 people. He is based between Antwerp and Porto.

Tiemen Schotsaert

Co-founder

Tiemen is Operations Director Property BeLux at CED, one of Europe's leading independent claims management organisations. In that role he oversees large-scale property claims processing across Belgium and Luxembourg — managing expert networks, insurer relationships, and the operational workflows that turn damage reports into settled claims.

Claims processing is an ideal stress test for LLM reliability: outputs are structured, errors have direct financial and legal consequences, and silent failures — a misread date, a misattributed amount, an internally inconsistent summary — can propagate through insurer systems for weeks before anyone notices. Tiemen brings the domain perspective that shapes Quelm's design from the use case inward rather than from the technology outward.

He holds a degree from KU Leuven and is based in the Ghent metropolitan area.

Get in touch·hello@quelm.ai·quelm.aiin Review · March 2026

Reliability intelligencefor production AI.

Reliability intelligence
for production AI.