Automating customer success on WhatsApp: how we built an hourly health scanner for our own tenants
A meat-shop owner sat frustrated for two days. We didn't notice. So we built the engine that would have. Here's the architecture.
The problem: silent breakage of customer accounts
In May 2026 a Pakistani butcher signed up to Botline, briefly went live, then his WhatsApp link broke. He didn't complain. He didn't email. He just… stopped using the app. We only spotted it on day three because the founder happened to query the database during an unrelated bug investigation.
This is the most ordinary failure mode in SaaS. It's not about the angry customers writing tickets — you already know about those. It's about the silent ones, where the account drifts from broken to cold to gone, and your only signal is a missing entry in your weekly retention dashboard six weeks later.
For a Botline tenant, “broken” can mean several things, and most of them aren't their fault: a WhatsApp pairing dropped silently, a misconfigured AI provider/model combo causes every reply to fail, an onboarding step got skipped, or the customer simply went quiet after a busy month and never came back. Each of these is detectable with a single SQL query against state we already have. The challenge isn't detection — it's acting on detection at human speed in a way that doesn't feel automated.
Why WhatsApp, not email, for re-engagement
Customer-success tooling defaults to email because that's where customer-success tooling grew up. We use email too — for the founder digest, for billing, for non-urgent ops. But for an account that's broken right now, email is the wrong channel for three reasons.
- Open rate. WhatsApp open rate is north of 90% within an hour. Marketing email is 20% on a good day, half of that on the average day, and almost all of those opens happen in the inbox preview pane without ever clicking through. If we're trying to surface a fixable problem before it becomes a churn event, we need eyeballs on the message in minutes, not days.
- Reply rate. WhatsApp replies come back at roughly 5x the rate of email replies, in our own benchmarks of platform-support outreach. Most of those replies are short — “haan, theek hai” or “trying now” — but a short reply is enough to convert a passive disconnect into an active conversation.
- Existing trust. Botline tenants already have a WhatsApp thread open with our founder — the platform-support thread that ships with every account (we wrote about that in the support migration write-up). Sending a re-engagement message into that thread feels like a friend texting, not a SaaS dunning email.
The only real downside is rate-limit risk. WhatsApp will rate-limit a number that sends too many cold-style messages. We address that with a strict per-tenant quota (max 1 outreach per signal type per 14 days; max 2 across all signal types per 7 days) enforced both at detection time and at send time.
Architecture: cron, detectors, draft generator, queue
The whole feature is four moving pieces, each independently testable and deployable:
- Hourly cron in our existing message service (cron-tenant-health.ts). No new container.
- Four detectors — each a single SQL query, each wrapped in its own try/catch so a failure in one doesn't cascade. They write rows into tenant_health_signals with status='pending'.
- Draft generator — calls Bedrock Haiku 4.5 with the signal type, the evidence payload, and the tenant's preferred language. Writes the draft text back to the signal row.
- Multi-admin queue at /admin/customer-success — an SSE-realtime list of pending drafts with Approve / Edit / Dismiss / Snooze actions. The send path reuses our existing platform-support send pipeline (the same one that powers support@botline.cc).
The four day-1 detectors:
- wa_disconnected — a primary WhatsApp instance has been in disconnected status for more than 24 hours after previously being live. The query joins tenants, tenant_whatsapp_instances, and messages to skip never-connected tenants (they get their own signal type). Evidence written: instance name, phone number, hours disconnected, last tenant message timestamp.
- never_connected — tenant created >48h ago, has a WhatsApp instance row provisioned, but has never reached connected. Onboarding stuck.
- provider_mismatch — the tenant's ai_provider and ai_model columns form an invalid combination (e.g. a Bedrock model selected on a DeepSeek provider line). Validated against a static regex map. We saw this hit three real tenants in the same week before we built the detector — a config bug that silently fails every AI reply.
- silent_after_activity — a tenant had ≥10 inbound messages in some prior week, then 0 inbound for the last 7 days. The classic engagement cliff. Could be vacation; could be churn. Worth a check-in.
Each detector also runs a NOT EXISTS check against tenant_health_signals to skip tenants we've already flagged for the same signal type recently — the rate limiter at detection time. We treat the rate limit at approve time as the safety net (in case settings changed mid-flight).
Multilingual via AI, not template explosion
Our tenant base spans Pakistan, Malaysia, Indonesia, the UAE, and India. That's six languages we routinely see in customer messages: Roman Urdu, English, Urdu, Arabic, Bahasa, and Hindi. Sometimes a tenant types in two of them in the same conversation.
The traditional customer-success approach is one template per language per signal: 4 signal types × 6 languages = 24 templates to author, translate, A/B test, and maintain in lockstep. Past three or four languages, this stops scaling and the company quietly defaults to English.
We don't do templates at all. Each draft is generated fresh by Bedrock Haiku 4.5 with a prompt that includes the signal type, the evidence payload (e.g. { duration_hours: 36, instance_name: "..." }), and the tenant's preferred language. The model writes the message in the founder's tone, addresses the specific issue, and gives 2–3 concrete next-step instructions.
The language is auto-detected once per tenant lifetime. On first detection, the language detector pulls up to 50 messages where role='tenant' — the messages the tenant has typed themselves, which is the most reliable signal. (We fall back to inbound customer messages if no tenant typing has happened yet.) Those samples go to Haiku 4.5 with a system prompt that asks for one of roman_urdu, english, urdu, arabic, bahasa, hindi, mixed. The result is cached on a new tenants.preferred_language column and reused forever after.
Cost: roughly $0.00004 per tenant lifetime for language detection (~500 input + 5 output tokens), and roughly $0.0008 per draft (~600 input + 80 output tokens). At our current platform scale that's about $0.72/month total for the AI portion of this feature. We could afford to rerun the language detector every draft and still be at coffee money.
The multi-admin queue UX
The queue at /admin/customer-success is a shared inbox across every admin in our workspace. Three product details mattered enough to call out:
- Optimistic locking on approve. When an admin clicks Approve & Send, we open a transaction, SELECT … FOR UPDATE the signal row, and verify status is still 'pending'. If two admins race the same draft, the second one sees a toast: “Already sent by Zeeshan 3 seconds ago.” No double-fires.
- Soft-edit indicator on textareas. When admin A opens the Edit textarea, a 5-minute lease is written to locked_by_user_id + locked_until with a client-side heartbeat. Other admins see “Zaheer is editing…” on that card and their textarea is read-only until the lease expires or A clicks away. This is a coordination hint, not a hard lock — the optimistic lock at approve time is the actual safety.
- Self-resolve detection at approve time. Between detection (cron tick) and approve (admin click) the underlying condition might have fixed itself — the tenant noticed the disconnect and reconnected on their own. We re-evaluate the evidence inside the approve transaction. If the condition no longer holds, we set status to 'self_resolved' and skip the send. Stops us spamming “hey we noticed you fixed it” messages, which would be the worst possible kind of customer-success outreach.
Every action is audited: who detected (the cron, with timestamp), who edited (with the original AI draft text vs the final edited text both retained), who approved, who dismissed, and the tenant's first response timestamp once they reply. The History tab reads from this and surfaces an effectiveness summary — sent / dismissed / self-resolved counts, response rate within 24h, reconnect-after-nudge rate for the disconnect signal.
Rate-limit guardrails and the “please don't spam me” rule
Customer-success automation that doesn't respect a customer's inbox is just noise. Two layers of rate limit:
- Per-tenant per-signal-type: max 1 outreach per signal per 14 days (configurable in tenant_health_settings.rate_limit_days). If we already nudged you about a disconnect 5 days ago, we won't nudge you again about a disconnect, full stop — even if the new disconnect is unrelated to the old one. We'd rather miss a real signal than burn the channel.
- Per-tenant cross-signal: max 2 outreaches total per 7 days. Hardcoded. This catches the case where a tenant has multiple unrelated problems firing simultaneously (disconnect + provider mismatch + silent week) — we send one message about the most pressing issue, not three.
Both checks happen at detection (the detector skips creating draft rows that would breach the limit) and at approve (in case settings changed mid-flight). The approve check returns 409 Conflict and the admin sees a toast explaining which limit was hit.
Snoozing is a per-card explicit override: an admin can hide a draft for 7 days if they have context (“tenant told me they're on holiday”), and it comes back automatically when the timer expires.
What we'd do next
This is admin-internal v1. We have three obvious extensions queued behind the false-positive measurement we'll do over the next month:
- Auto-send tier for high-confidence signals. Every day-1 signal still requires a human approval click. Once we have a measured false-positive rate per signal type, the lowest-FP categories — probably never_connected and provider_mismatch — can flip to auto-send with admin notification rather than admin approval. Disconnect and silent-after-activity stay human-in-the-loop because the false-positive cost is higher (vacation, intentional pause).
- More signal types. Booking-page funnel (“traffic but no bookings in 7d”), Instagram token expiry (“token expires in 7d”), and subscription-churn-risk are the next three. Each is a single SQL query against state we already have; the gating factor is that we want to ship them after the queue UX is proven on the day-1 signals to avoid drowning the queue.
- Productise it for tenants. The whole engine — detector framework, draft generator, multilingual prompt, multi-admin queue — can run against a tenant's own customer base instead of our tenant base. Tenants would get to define their own signals (“customer hasn't ordered in 30 days”, “cart abandoned twice”), and the AI would auto-draft re-engagement WhatsApps in the customer's preferred language. Customer-success automation as a feature rather than as a category — charged at the same price as a normal Botline plan instead of $1k–5k/month.
The bigger lesson, which we keep relearning: most SaaS customer-success problems don't need ML or behavioural prediction. They need a SQL query, a small AI for the draft text, and an admin willing to click Approve. The expensive part is the willingness, not the infrastructure.