A few quarters back, a project landed on my desk scoped in three words: reduce involuntary churn. No spec, no acceptance criteria, no preferred approach. Just the outcome, the assumption that I'd figure out the rest, and I loved it. What follows is what I figured out.
Most involuntary churn isn't a decision. Nobody opened your billing portal and rage-cancelled. A card expired, a bank declined for insufficient funds, an issuer's fraud heuristic got twitchy at 2 a.m. The customer still wants the product. The money just didn't move on the first try.
A dunning engine exists to close that gap: turn a soft decline into a successful capture before the customer ever notices, and do it without burning the goodwill of people who were never trying to leave. Yet a lot of teams treat dunning as a cron job that retries a charge every few days and fires off a sad email. That works the way a coin flip works.
Here's what actually separates an engine that recovers revenue from one that just annoys people: the lessons that three-word project taught me. Three things it cannot live without, and three that are genuinely nice but won't sink you if they're missing.
§01 · The non-negotiables
The load-bearing
walls.
Skip one of these and you don't have a dunning engine, you have a retry loop with a mailing list.
A retry schedule built from decline reasons, not a calendar
The single most common mistake is retrying every failed payment on the same fixed cadence (day 1, day 3, day 7) regardless of why it failed. That's leaving money on the table, because the decline code is telling you exactly what to do.
An insufficient_funds decline
is a timing problem. Retry it near payday, or a few days out,
and a meaningful share recovers on their own. A
card_expired decline will
never recover on a retry. The card is dead, and
hammering it just trains the issuer to treat you as abusive. A
hard do_not_honor or
fraudulent is a brick wall;
stop. Soft network errors should retry within minutes, not
days.
Your engine has to branch on the gateway's response and route each decline into the schedule that fits it. A single retry schedule for every decline is a sign nobody looked at why the payment actually failed.
A state machine that knows where every invoice is
The moment you have retries, notifications, partial recoveries, and customer-initiated payment updates all happening asynchronously, you are running a distributed workflow whether you meant to or not. And distributed workflows that aren't modeled as state machines turn into race conditions you debug at midnight.
You need an explicit, persisted state per invoice:
failed → retrying → recovered,
with terminal branches for
voluntarily_paid,
written_off, and
canceled. Every transition
has to be idempotent. A webhook will arrive twice. A retry will
fire while a customer is mid-checkout updating their card. If
two paths can both mark an invoice paid, one of them will
eventually double-charge someone, and that someone will tell
the internet about it. Model the states explicitly so the
illegal transitions are simply impossible, not merely unlikely.
One refinement worth knowing once you go to build it: in
production, these aren't all the same kind of state. The
leading gateways keep two layers apart.
retrying really lives at the
payment-attempt layer, not the invoice layer. Most providers
hold the invoice in a single “open” status and track
retries on a separate attempt object and counter, surfacing
“past due” or “retrying” only as a
display badge. Writing off the debt, voiding the obligation, and
revoking access are three separate switches rather than one, and
“recovered” is usually derived from the attempt
count on a paid invoice rather than stored as its own status.
The flat list above is the right mental model; the schema
underneath it has two layers, and matching that split makes your
engine line up cleanly with how the gateways already think.
A graceful exit: dunning that actually ends
Every dunning sequence needs a defined terminal state and an off-ramp, and a surprising number don't. What happens after the last retry fails? When does access get revoked, and is there a grace period? When the customer finally updates their card on day 9, does every queued retry and reminder stop immediately?
That last one is where engines quietly humiliate their owners. A customer pays, then keeps getting “your payment failed” emails for a week because the recovery path didn't cancel the dunning path. Now a recovered customer feels like they're fighting your billing system. The exit has to be as deliberately designed as the entry: clean revocation, clean reactivation, and a hard stop on all messaging the instant the money lands.
“A retry loop with no terminal state isn't persistence. It's a system that doesn't know when it has won.”
§02 · The bonuses
Nice to have,
not load-bearing.
These are real advantages, and a mature engine should grow into all three. But you can ship and recover revenue without them on day one. So don't let them block the launch.
Smart retry timing from your own outcome data
Once the basics work, the next gain is timing retries against the patterns in your own recovery history rather than a hand-tuned schedule. Some cohorts recover best on weekday mornings; some issuers clear declines more readily at certain hours. Modeling that lifts recovery rates a few points.
It's a bonus because it's an optimization on top of a thing that already works. A sensible decline-aware schedule captures most of the available revenue; squeezing the last slice with learned timing is the kind of thing you do in quarter two, not week one.
Network-level tools: account updater and network tokens
Card Account Updater (via Visa/Mastercard, surfaced through most gateways) automatically refreshes expired or reissued card numbers behind the scenes, and network tokens keep a credential alive across reissues. Both quietly prevent a chunk of failures from ever entering dunning. The best recovered payment is the one that never failed.
I list it as a bonus only because it's an enrollment and integration project that depends on your gateway and card mix, not core engine logic. When you turn it on, do it for the prevention; just know it sits beside your dunning system rather than inside it.
A notification ladder with real escalation
A good engine eventually pairs each retry with a customer message that escalates in tone and channel: a gentle heads-up first, a clearer warning as the terminal state nears, maybe SMS for high-value accounts, each with a one-click path to fix the card. Done well, this converts the customers who can self-serve before you ever exhaust retries.
It's gravy because the engine recovers money on silent retries alone; notifications add a lift, especially for expired cards where the customer is the only one who can fix it. Build the ladder once the core loop is solid, and respect the same rule as everywhere else: the instant they pay, the messaging stops.
There's a quieter lever inside the ladder, too. A dunning message doesn't have to be a billing chore. It can remind the customer what they've actually built on your platform. Pull the numbers your system already has and put them in front of them: invoices sent, funnels launched, revenue processed, members onboarded. A note that leads with “you've collected $48,000 through your storefront this year, update your card to keep it running” reframes the moment from a bill I forgot to a thing I'd lose. The failed card stops being an annoyance and starts looking like a threat to something they've invested in.
One condition makes or breaks this, and it's the whole reason it's a bonus and not a rule: only surface those numbers when they're genuinely good. Show a thriving account its own traction and you raise the cost of walking away. Show a barely-active account a thin stat line and you've handed them the exit. You've reminded a lukewarm customer they were never getting much value, right at the moment they're deciding whether to bother. Gate the usage-stats variant behind a real engagement threshold, and fall back to the plain “update your card” message for everyone below it. The progress story only works for the customers who actually have one.
The strategic takeaway.
The instinct is to start with the clever part: the learned timing, the multi-channel escalation, the network integrations. Resist it. The recovery comes from the unglamorous core: read the decline code and retry accordingly, model every invoice as an explicit state machine, and design an exit that ends cleanly the moment the money lands.
Get those three right and you have a system that quietly recovers a large share of failed payments while staying invisible to the people it's recovering from. The bonuses each add a few points on top. But a beautiful retry-timing model bolted to an engine with no defined terminal state isn't a marvel. It's just a more sophisticated way to double-charge someone and email them about it afterward.