AI Engineering

Designing the Retry: Making LLM Calls Fail Like Grown-Ups

TL;DR

A try/except with a three-attempt loop is the wrong way to handle LLM failures, because LLM calls don't fail like normal API calls. They fail in at least five distinct shapes — malformed output, refusal, timeout, rate limit, and partial stream — and each one wants a different response: retry-with-repair, fallback model, degrade gracefully, or hard-fail. Retrying a refusal wastes money and loops. Retrying a malformed-JSON error without changing the prompt repeats the same mistake. The fix is a failure taxonomy, the right action per type, idempotency, jittered backoff, and a circuit breaker so a bad minute doesn't become a bankrupt afternoon.

May 2, 20267 min read
LLMReliabilityRetryError HandlingAI Engineering

The first production incident I ever caused with an LLM was not a wrong answer. It was a retry loop.

I had wrapped a model call in the same defensive pattern I'd used for a decade on flaky HTTP services: try, catch, retry three times with backoff. Reasonable. Battle-tested. Completely wrong for LLMs. One afternoon a prompt change started producing JSON the model couldn't quite close, the parser threw, my loop dutifully retried the identical request three times, and every retry produced the identical broken JSON — because nothing about the request had changed. I paid for four failures where one would have done, multiplied across every request that hour.

That's the core insight: LLM calls don't fail like API calls, so they can't be retried like API calls. An HTTP 503 is the same failure every time and a retry is a reasonable bet that the server recovered. An LLM failure might be transient (rate limit), might be permanent until you change something (malformed output), or might be the model correctly refusing (a "no" you'll keep paying to hear).

The Failure Taxonomy

Before you can retry well, you have to name what broke. There are five shapes I plan for.

FailureWhat it looks likeWhat it means
Malformed outputJSON won't parse, schema violationRequest was delivered, model answered wrong-shaped
Refusal"I can't help with that"Model worked correctly; the answer is no
TimeoutNo response in budgetCould be transient, could be a too-hard task
Rate limit429 from the providerPurely transient, purely about pacing
Partial streamStream cuts mid-tokenConnection died; you have half an answer

Lumping these into one except block is the original sin. They are five different problems and they want four different responses.

The Decision Table

This is the table I keep next to any retry code I write. The action depends entirely on the failure type — never on a blanket "retry N times."

┌────────────────┬──────────────────┬───────────────────────────────┐
│ FAILURE        │ RETRY?           │ CORRECT ACTION                │
├────────────────┼──────────────────┼───────────────────────────────┤
│ Malformed      │ Yes, but REPAIR  │ Re-prompt with the error +    │
│ output         │ first            │ the bad output. Don't resend  │
│                │                  │ the same request.             │
├────────────────┼──────────────────┼───────────────────────────────┤
│ Refusal        │ NO               │ Hard-fail or route to human.  │
│                │                  │ Retrying pays for the same no.│
├────────────────┼──────────────────┼───────────────────────────────┤
│ Timeout        │ Once, then       │ Retry same model once; if it  │
│                │ FALLBACK         │ repeats, degrade or fall back.│
├────────────────┼──────────────────┼───────────────────────────────┤
│ Rate limit     │ Yes, with        │ Backoff + jitter. Respect the │
│                │ BACKOFF          │ Retry-After header if given.  │
├────────────────┼──────────────────┼───────────────────────────────┤
│ Partial stream │ Yes, but         │ Resume if you can; else retry │
│                │ carefully        │ fresh. Never show the partial.│
└────────────────┴──────────────────┴───────────────────────────────┘

Malformed output → retry with repair

Do not resend the same request. Send a new request that includes the broken output and the parse error, and ask the model to fix it. This is the difference between "try the same thing again" and "here's what went wrong, correct it." The first repeats the mistake; the second usually fixes it on the first repair attempt.

try:
    result = parse(completion)
except ParseError as e:
    repair = call_model([
        *original_messages,
        {"role": "assistant", "content": completion},
        {"role": "user", "content": f"That failed to parse: {e}. "
                                     f"Return only valid JSON matching the schema."},
    ])
    result = parse(repair)  # one repair attempt, then hard-fail

Refusal → do not retry

A refusal is the model working correctly. Retrying it is paying three times to be told "no" three times — and if your prompt or content tripped a safety boundary, it'll trip again. Route to a human, return a graceful message, or hard-fail. Just don't loop.

Refusals masquerade as malformed output

A refusal often arrives as text where your code expected JSON, so your parser throws and your malformed-output retry kicks in — now you're retrying a refusal. Detect refusals explicitly before the parse step, or you'll burn budget retrying a firm and final no.

Timeout → retry once, then fall back

A timeout is ambiguous: maybe the network hiccuped, maybe the task is genuinely too hard for the current model and budget. Retry the same model exactly once. If it times out again, that's signal — fall back to a faster/smaller path or degrade gracefully ("I couldn't complete that, here's what I have").

Rate limit → backoff with jitter

The only failure that behaves like a classic transient error. Back off exponentially, and always add jitter — without it, every client that got rate-limited at the same instant retries at the same instant and stampedes the provider again. Respect the Retry-After header when the provider sends one.

delay = min(base * (2 ** attempt), cap)
delay += random.uniform(0, delay * 0.3)   # jitter prevents stampede

Partial stream → never show the half-answer

A stream that dies mid-token has given you a fragment that may be coherent enough to look complete and wrong enough to be dangerous. Discard it. Retry fresh, or resume from a checkpoint if your provider supports it — but never render the partial to a user.

The Two Things That Save You From Yourself

Idempotency

If a write happens on the back of an LLM call, every retry risks doing it twice. Attach an idempotency key derived from the request so a retried call that already succeeded server-side doesn't fire the action again. This is the difference between a retry that's safe and a retry that double-charges a customer.

A circuit breaker

This is the one that turns my old incident from a catastrophe into a blip. After N failures in a short window, stop calling the model entirely and fail fast for a cooldown period.

        failures < N        ┌──────────┐
   ─────────────────────────►  CLOSED   │ calls flow normally
                             └────┬─────┘
                  N failures fast │
                             ┌────▼─────┐
                             │   OPEN    │ reject immediately,
                             └────┬─────┘  no model calls, save $$$
                  cooldown elapsed│
                             ┌────▼─────┐
                             │ HALF-OPEN │ let one test call through
                             └──────────┘

A naive retry loop is a billing weapon

A retry loop with no circuit breaker, wrapped around a streaming or agentic call, can re-issue the same expensive request hundreds of times before a human notices. I've watched it happen. The circuit breaker isn't optional infrastructure — it's the seatbelt that keeps a bad minute from becoming a bankrupt afternoon. Pair it with the cost-monitoring pulse so the spike shows up on a dashboard the same day.

Failing Like a Grown-Up

Mature error handling for LLMs isn't about preventing failure — these systems fail constantly and that's fine. It's about failing in a way that's cheap, bounded, and explainable. Name the failure. Pick the action from the table. Repair instead of repeat. Back off with jitter. Stay idempotent. And put a circuit breaker between your good intentions and your invoice.

Do that, and your LLM features fail like grown-ups: quietly, safely, and without taking the rest of the system — or the budget — down with them.

Frequently Asked Questions

Don't miss a post

Articles on AI, engineering, and lessons I learn building things. No spam, I promise.

OR

Osvaldo Restrepo

Senior Full Stack AI & Software Engineer. Building production AI systems that solve real problems.