Tool calling in production: what nobody tells you
Most AI demos skip the hard part. Here's the gap between a fun prototype and an agent your users can actually rely on.
Tool calling is the single most important primitive for building useful AI products. It’s also the one most demos hide the rough edges of.
I’ve spent the last few months wiring agents into real product flows — payments, scheduling, knowledge retrieval. Here are the four things I wish someone had told me on day one.
1. The model will call tools you didn’t define
Even with strict JSON schemas, a model that’s been fine-tuned on a thousand other tool surfaces will occasionally hallucinate one of yours. Defensive parsing is non-negotiable.
function safeParseTool(call: ToolCall) {
if (!KNOWN_TOOLS.has(call.name)) {
return { ok: false, error: "unknown_tool" } as const;
}
return { ok: true, payload: call };
}
If you don’t validate, your agent will eventually try to call search_emails on a system that has never had email access. Then it hangs, or worse, it confabulates a result.
2. Latency compounds, fast
A simple chain — plan → call tool → reflect → respond — easily hits four LLM round-trips. At 800ms each you’re at 3.2 seconds before the user sees anything. Two fixes that have outsized impact:
- Streaming the final response, even when the tool steps were sequential. The user starts reading immediately.
- Parallel tool execution when the model emits multiple tool calls in one turn. Most SDKs support this natively; many implementations don’t use it.
3. Idempotency matters more than you think
Agents retry. Networks fail. The model itself sometimes decides to “double-check” by calling the same tool twice. If your create_invoice is not idempotent, you’ll bill someone twice on a bad day.
Pass a deterministic idempotency_key derived from the conversation turn ID — not from the tool arguments, because the model will paraphrase them between attempts.
4. Observability is the moat
The difference between an agent you can ship and an agent you can’t is whether you can answer “what did the model do at 14:32 yesterday and why?”
The minimum I now ship with every agent:
- Full trace per turn (prompt, tool calls, tool results, final output) stored for 30 days.
- Token cost + latency per turn, plotted over time.
- A “replay” endpoint that re-runs a stored turn against a different model.
That last one is what lets you upgrade models without praying.
The difference between a demo that wows and a product that survives contact with users almost always lives here — not in the model choice, not in the prompt. In the defensive parsing, the idempotency keys, the traces.