Building Botline Anchor: a five-layer middleware that keeps WhatsApp AI grounded in your real catalog

A 12% bug rate on a 25-reply audit, three of which would have made the news. We built the engine that catches them. Here is the architecture.

May 8, 2026·9 min read

The problem: three bugs in one week of replies

In May 2026 we ran a quality audit on a Pakistani electronics retailer's last seven days of WhatsApp replies on Botline. Twenty-five replies sampled, three of them broken. Roughly a 12% bug rate — not catastrophic on its own, but every one of the three bugs had the same shape, and every one of them was the kind of bug that would cost a real customer real money or real trust.

Bug 1, 2026-05-02 — out-of-stock offer. The bot quoted Rs. 28,999 for an “Anker Power Bank 25K mAh 165W” and offered to take an order. The same product was correctly flagged as out of stock in another conversation the same day. The model had no consistent grounding on what was sellable.

Bug 2, 2026-05-07 — price hallucination. The bot quoted Rs. 1,04,999 for a Samsung Galaxy Watch Ultra 2025. Actual catalog price: Rs. 98,999. A Rs. 6,000 overshoot, in the bot's own confident voice, no hedge. Whoever placed an order at that quote would have either felt cheated when the real invoice landed or accepted the markup and the retailer would have eaten it as a goodwill discount.

Bug 3, 2026-05-05 — triple failure. A customer asked about the UGREEN Nexode 25k mAh. The bot replied in pure Bahasa Malay (tersedia, dengan harga, Ia mempunyai), with the wrong capacity (20k instead of 25k), and the wrong price (Rs. 27,999 instead of Rs. 21,999). Three failures in one reply: language, capacity, price. To a Pakistani customer who reads no Bahasa.

This is the same family of failure that bit our own platform-support outreach a few weeks earlier (the Hadeed Farm fabricated-number incident, which prompted the Phase 2 emergency cut on 2026-04-22). The pattern repeats: the model is fluent, the model is confident, and the model is wrong in a way no amount of prompt tuning has yet fixed. With hundreds of tenants and tens of thousands of replies per day projected by year-end, daily incidents are the floor, not the ceiling.

So we stopped trying to fix the model and built a layer around it.

Why the obvious fix — “just add a verifier prompt” — does not work

The first thing every team reaches for when an LLM hallucinates is to ask a second LLM to fact-check it. “Read this reply. Is anything wrong? If yes, fix it.” We tried that in Phase 1 (April 2026) and again in Phase 2 (the emergency cut). It works for some categories of bug, but it fails the most expensive ones.

The reason is structural. A verifier prompt asks the model to detect its own hallucinations, and a model that confidently invents a price is, by definition, the same model that confidently believes the invented price is correct. The verifier reads the original reply, sees no contradiction with anything in its training, and stamps it approved. We watched this happen to the Watch Ultra 2025 quote: the verifier had no independent ground truth to compare against, so it agreed with the original lie.

The fix is not a smarter verifier. The fix is to give the verifier independent ground truth. That ground truth already exists — it is sitting in woocommerce_products.price_cents and stock_status. The job of the middleware is to put that data into the responder context (so the model is less likely to lie in the first place), then deterministically extract every price and stock claim from the reply (so the verifier is code, not another guess).

Once you frame it that way, the design falls out of itself. You need a layer to inject the live data, a layer to verify the reply against the snapshot, a separate layer for the things you cannot capture in a SQL row (vocabulary preferences, language constraints, cultural register), and a layer to keep score so you know when a rule needs revising. That is the five-layer architecture.

Architecture: a package with two adapters

Before any of the layers existed, we made a decision about packaging that turned out to matter more than any individual algorithm. The whole grounding-and-verification engine lives in a single workspace package, @botline/anchor, with two adapters: an embedded adapter that runs in-process inside the existing message service, and an HTTP adapter that exposes the same operations as REST endpoints behind an authenticated API.

Day one we ship only the embedded adapter. The HTTP adapter is scaffolded, fully tested, and dormant — not deployed, no docker-compose entry, no SSL cert. But the architecture is already two-adapter so that nothing about Day 1's shape blocks Day N's commercial path.

Embedded — import { anchor } from '@botline/anchor/embedded'. Runs in-process, talks directly to Drizzle and to the existing chat() helper for the regenerate loop. Zero network hops. This is what every Botline tenant's reply flows through.
HTTP — packages/anchor/src/adapters/http.ts. A Fastify server that exposes POST /verify, POST /score, and the rule-registry CRUD endpoints. Authenticated via X-Anchor-Key. Built and tested but not deployed. When we expose Anchor as a standalone API for other WhatsApp-AI vendors (Path 2 in the commercial roadmap), the deploy is a docker-compose entry and an ECR build, not a refactor.

This is the same pattern that worked well for the email package and the booking package: write the core as pure logic with zero I/O, write thin clients for each delivery mode, ship one mode now and unblock the others later. The cost is one extra TypeScript file (adapters/http.ts) on Day 1; the savings are six months of refactor pain on Day N.

The five layers, one at a time

Each layer is a single TypeScript file in packages/anchor/src/core/. Each is independently testable. Each is independently disable-able via env flag. Each fails open.

Layer 1 — The Codex. A versioned rule registry. Two tables (anchor_rules and anchor_rule_versions) plus a partial unique index that enforces exactly one active version at a time. Each rule has a kind enum (vocab_prefer, vocab_ban, register_match, market_fact, tenant_override, grounding_constraint, language_constraint) and a severity enum (block / warn / audit). The runtime reads the active snapshot, never individual rule rows, so a rule edit is atomic across all running responder calls.

Layer 2 — Prompt Composer. Builds the responder system prompt fresh on every reply. Inputs: tenant, archetype, conversation history, last classification output. Output: a string that merges the founder voice, the top 30 vocab prefer/ban pairs from the Codex for this tenant's country, the cultural-register hints based on whether the conversation is a complaint vs a browse vs a confirm, and any tenant-specific overrides. Replaces the hand-rolled string concat that lived in the legacy responder for the last six months.

Layer 3 — Live Data Injector. The most important layer. At responder context-build time, it pulls the last 10 messages, runs a regex + entity classifier to extract product mentions, queries WooCommerce / Shopify with a multi-strategy lookup (sku exact > token-overlap > substring > embedding fallback), and writes a structured live_inventory block into the responder context. The block looks like this:

<live_inventory>
  <product sku="WCH-ULT-2025">
    <name>Samsung Galaxy Watch Ultra 2025</name>
    <price_pkr>98999</price_pkr>
    <stock_status>instock</stock_status>
    <stock_quantity>14</stock_quantity>
  </product>
</live_inventory>

And the prompt grows a hard constraint: only quote prices, stock states, and variants from this block. If the product the user asked about is not here, say you'll check. This is what would have prevented the Watch Ultra 2025 hallucination in the first place. Layer 4 catches what slips through; Layer 3 prevents.

Layer 4 — Output Verifier. Runs after the LLM returns. Regex-extracts every Rs. [\d,]+ token in the reply along with the preceding 3–8 words (the product name). For each, it looks up the price in the productsSnapshot that Layer 3 wrote. If the quoted price differs from the snapshot price by more than 5% (the tolerance is for typo robustness, not real divergence), the violation is recorded. Same for stock-availability claims (“available hai”, “in stock”, “stock mein hai”) which get checked against snapshot stock_status. If any block-severity violation exists, the reply is regenerated by calling chat() a second time with the original prompt plus the violation injected as critique. Cap of two regenerates; on the third attempt the reply falls open and is flagged for human review.

Layer 5 — Language Guard. A separate post-LLM check that classifies the dominant language of the reply. If it matches the Codex's banned-languages list for this tenant (e.g. Bahasa Malay on a Pakistani tenant), the reply is regenerated. We cache the detection per reply hash so repeat tests do not re-spend.

Layer 6 (the scoring layer) — Quality Telemetry. After Layers 4 and 5 finish, the scorer computes four axis scores on a 0..1 scale (price match, stock match, vocab match, language match) and a weighted composite (price 35%, stock 30%, language 25%, vocab 10%). Every reply gets a quality-log row; replies under 0.85 also get queued for review. The data flywheel turns flagged replies into new rules, and new rules into fewer flagged replies.

Fail-open by design: why we are not afraid to ship this

Every middleware that sits between an LLM and a customer is, by definition, a new failure mode. If Anchor crashes, times out, or returns garbage, what does the customer see? The wrong answer to that question is “an error.” The right answer is “the reply they would have gotten without Anchor.”

The reliability invariant we wrote on the whiteboard before any code: Anchor's reliability is bounded above by Anchor's success rate, never below today's baseline. If every Anchor layer crashes simultaneously, customers get the same replies they get today. Anchor only adds upside.

The implementation is a single helper, withAnchor(), that wraps every layer call:

export async function withAnchor<T>(
  layer: AnchorLayer,
  ctx: AnchorContext,
  fn: () => Promise<T>,
  fallback: T
): Promise<T> {
  if (process.env.ANCHOR_KILL_SWITCH === 'true') return fallback;
  if (!ctx.tenant.anchorEnabled) return fallback;
  if (!isLayerEnabled(layer)) return fallback;
  try {
    return await withTimeout(2_000, fn());
  } catch (err) {
    await logAnchorFailure({ layer, ctx, err });
    return fallback;
  }
}

Three kill-switch levels stack into this helper. Master env flag (ANCHOR_KILL_SWITCH=true) bypasses every layer for every tenant in roughly 90 seconds (env edit + force-recreate of the message container). Per-feature env flags (ANCHOR_GROUNDING_ENABLED, ANCHOR_LANGUAGE_GUARD_ENABLED, etc.) disable a single misbehaving layer while the rest keep working. Per-tenant DB toggle (tenants.anchor_enabled) bypasses for a single tenant; takes effect on the next reply.

Every fail-open path writes a row to anchor_failures with the layer name, tenant id, message id, and the error. We tail this table from the same telemetry dashboard. If a layer starts failing open at any meaningful rate, we see it within an hour and either roll back the offending Codex version or flip the layer's env flag while we investigate.

The one thing we never do is block a customer reply on an Anchor failure. The bare LLM reply is what we ship today. Anchor is the upgrade. Failing back to the upgrade-less state is, by definition, not a regression.

The data flywheel: flagged replies become new Codex rules

The mechanical part of Anchor is interesting; the strategic part is the data flywheel. Every reply that lands in anchor_flagged_replies with a sub-0.85 composite score is a candidate Codex rule. The admin reviews the flagged reply at /admin/anchor/flagged, and at the bottom of the review form there is a “Propose new rule” button that pre-fills a rule editor with the most likely shape (vocab swap, banned phrase, grounding constraint, register match) inferred from the flag reason.

The admin can refine, click Save Rule, then click Publish Version. The new rule goes live in 60 seconds. From that moment on, every responder call retrieves a Codex snapshot that includes the new rule, and every Output Verifier pass uses it to grade. The next time a similar bug shape occurs, the verifier catches it before the customer sees it.

Versioning is what makes this safe. Every publish snapshots all enabled rules into a new anchor_rule_versions row and flips the is_active flag (the partial unique index ensures atomic switch-over). If the new rule turns out to over-trigger and break legitimate replies, the admin clicks Rollback on the previous version and the entire rule set reverts in the next codex-client poll cycle. No deploy. No code change. The rule registry is configuration, not code.

What this means for the moat is that the engine is replaceable but the rules are not. Today's leading AI model will be replaced by something better in a quarter, and by something nobody has heard of yet in two. None of those swaps invalidate a single Codex rule. Every market quirk we have encoded, every Pakistani vocab preference we have observed, every cultural register hint we have validated — all of it stays. The rules accumulate. By month 12 we will have hundreds; by month 24, thousands. No competitor walks in on day one with that.

What is next: B2B and beyond

Day 1, Anchor is bundled differentiation. Every Botline tenant gets it for free; the marketing line is “the only WhatsApp AI grounded in your real catalog.” No metered tier, no upcharge, no “quality add-on.” Per our pricing rule for Botline (included in existing tiers, infra-only fees passed through), there is no clean way to charge separately even if we wanted to.

The interesting commercial paths are downstream:

Path 2 — standalone B2B (~9 months out). Once the Codex has 50K+ rules and we have measurable quality numbers across hundreds of tenants, we expose the HTTP adapter as a public API at anchor.botline.cc. Anchor Free for 10K verifications/month with the default Codex and a brand watermark. Anchor Pro at $99/mo for 1M verifications, custom Codex, no watermark. Anchor Enterprise at $999+/mo for dedicated cluster, your-Codex-in-our-cloud, SLA. Wati and Dealism would be the obvious first customers; their hallucination rates are visibly worse than ours and they have nothing equivalent in their pipeline. The code already supports this path — the HTTP adapter is scaffolded and tested, just not deployed.
Path 3 — open-source SDK + paid Cloud (~18–24 months). Open-source the @botline/anchor package; charge for the hosted Codex, the hosted telemetry, and industry-specific rule packs as paid subscriptions. The SDK code is the giveaway; the rule registry is the moat. Same model as Hashicorp's open-source-core / paid-cloud, applied to LLM grounding.

The lesson, which we keep relearning across the platform: most LLM quality problems are not about better prompts or better models. They are about independent ground truth, structured rule registries, and fail-open architecture. The model gets a single shot at a hard problem; the middleware gets infinite shots at the easy parts surrounding it. That is where reliable AI lives.

If you want to see Anchor in action, sign up at botline.cc/signup and connect a WooCommerce or Shopify store. Every reply your AI sends from that point on is going through all five layers. The marketing page lives at /features/anchor if you want the executive summary instead of this engineering deep-dive.

Ready to automate your WhatsApp?

See the feature page Start free Connect WooCommerce See pricing