Reliability
Built-in fault tolerance for agent execution
The Dispatch system includes built-in reliability features to ensure agents complete their work.
Why Reliability Matters
Agents can fail for many reasons:
- API rate limits
- Network issues
- Model errors
- Timeouts
Without reliability, failed tasks are lost.
Automatic Retries
Failed tasks are automatically retried with exponential backoff:
- First retry - 1 second delay
- Second retry - 2 seconds
- Third retry - 4 seconds
- And so on...
This prevents overwhelming services while ensuring eventual completion.
State Persistence
Workflow state is persisted across:
- Server restarts
- Deployments
- Failures
A workflow can pause, the server can restart, and it picks up where it left off.
Idempotent Execution
Steps are designed to be safely re-run:
- Same input → Same output
- No duplicate side effects
- Safe to retry
Error Handling
When retries are exhausted:
- Fallback steps can execute
- Notifications are sent
- Tasks are marked failed
- Humans can intervene
Monitoring
Track execution health:
- Success/failure rates
- Average execution time
- Retry frequency
- Error patterns