Multi-step systems fail in interesting ways at the seams. The discipline of placing seams well is what separates a Friday-night-demo from a Monday-morning-running system.
Idempotency: the most important word
Every step that touches the outside world should be safe to retry. If your agent emits send_email() and that step fails halfway, retrying should not send a duplicate email. The usual trick: dedupe key derived from the input.
Where state lives
- In the message history — works for ephemeral state during a single loop. Loses it on restart.
- In a database row keyed by run_id — works for state that needs to outlive the loop.
- In the external system — when the source of truth is the CRM or the ticket, prefer reading it back vs. caching what you wrote.
Retries with intent
Default retry policy: 3 attempts with exponential backoff for transient failures (network, rate limit, 5xx). Do not retry 4xx — they're your fault and won't fix themselves. Always log the attempt count alongside the run.
Knowledge check
0/1 answered1. An agent step emits create_invoice(). The step fails midway. The safe retry policy requires that:
Discussion
0 commentsBe the first to start the conversation.