Structured Outputs and Tool Calling: Making LLMs Reliable

I shipped a feature last year that classified support tickets into six categories and extracted a priority score. The first version asked the model for "a JSON object" in the prompt and parsed the response with JSON.parse. It worked in the demo. It worked for two weeks. Then a customer pasted a code snippet with a stray backtick into a ticket, the model wrapped its answer in a markdown fence to be "helpful," and my parser threw on the leading ```json. The pager went off at 2 a.m. for a string-handling bug that had nothing to do with the actual task.

That is the whole problem with free-text parsing: the failure mode is invisible until production traffic finds the one input you didn't imagine. The fix is to stop treating the model's output as text you hope is JSON, and start treating it as a typed value the platform is contractually obligated to produce. Both OpenAI and Anthropic have first-class support for this now, and the gap between "JSON mode" and "schema-enforced generation" is the gap between hoping and knowing.

Why free-text parsing is fragile

When you ask a model for JSON in the prompt, you are relying on its tendency to comply, not a guarantee. The model can still: wrap output in a markdown fence, emit a trailing comment, hallucinate an extra field, use a string "3" where you expected a number 3, truncate mid-object when it hits the token limit, or prepend "Sure! Here's the JSON:". Every one of those is a JSON.parse exception or a silently wrong value downstream.

You can paper over some of this with regex extraction and lenient parsers. Don't. You'll spend more time maintaining a brittle parser than it would have cost to use the constrained-generation API in the first place.

Three levels of "structured"

There's a real hierarchy here, and people conflate the bottom two constantly.

Level	What it guarantees	API
Prompt-only JSON	Nothing. Model usually complies.	Any model, just ask
JSON mode	Output is syntactically valid JSON. Says nothing about shape.	OpenAI `response_format: { type: "json_object" }`
Schema-enforced	Output is valid JSON and matches your schema (right keys, right types, required fields present).	OpenAI Structured Outputs (`json_schema`, `strict: true`); Anthropic tool-use with `input_schema`

JSON mode solves the "is it parseable" problem and nothing else. You can still get {"category": "urgnet"} with a typo'd enum value. Schema-enforced generation is the one you want, because the provider constrains token sampling so the output cannot violate the schema — it's grammar-constrained decoding, not post-hoc validation. The model literally cannot emit a token that would invalidate the JSON against your schema.

OpenAI calls this Structured Outputs. Anthropic doesn't ship a separate response_format; the idiomatic path is to define a tool whose input_schema is your target shape and force the model to call it. Same outcome, different door.

Structured outputs with Zod and a retry loop

Here's the pattern I use in production with the OpenAI SDK. I define the schema once in Zod, derive the JSON Schema the API needs, and validate the parsed result against the same Zod schema as a belt-and-suspenders check. Even with strict: true, I keep the validation because schema enforcement covers structure, not semantics — it won't catch a logically impossible value your business rules forbid.

import OpenAI from "openai";
import { z } from "zod";
import { zodResponseFormat } from "openai/helpers/zod";
 
const client = new OpenAI();
 
const TicketTriage = z.object({
  category: z.enum([
    "billing",
    "bug",
    "feature_request",
    "account",
    "security",
    "other",
  ]),
  priority: z.number().int().min(1).max(5),
  summary: z.string().max(200),
  needs_human: z.boolean(),
});
 
type TicketTriage = z.infer<typeof TicketTriage>;
 
async function triage(ticket: string, attempt = 1): Promise<TicketTriage> {
  const completion = await client.chat.completions.parse({
    model: "gpt-4.1-2025-04-14",
    temperature: 0,
    messages: [
      { role: "system", content: "Classify the support ticket. Be terse." },
      { role: "user", content: ticket },
    ],
    response_format: zodResponseFormat(TicketTriage, "ticket_triage"),
  });
 
  const parsed = completion.choices[0].message.parsed;
 
  // The SDK refusal field: the model can decline rather than emit junk.
  if (completion.choices[0].message.refusal) {
    throw new Error(`Model refused: ${completion.choices[0].message.refusal}`);
  }
 
  const result = TicketTriage.safeParse(parsed);
  if (!result.success) {
    if (attempt >= 3) {
      throw new Error(`Schema validation failed after 3 attempts: ${result.error.message}`);
    }
    return triage(ticket, attempt + 1);
  }
  return result.data;
}

zodResponseFormat does the Zod-to-JSON-Schema conversion and sets strict: true for you. With strict mode on, validation will essentially never fail on structure — but it can still fail if you tighten the Zod schema beyond what JSON Schema can express (cross-field invariants, .refine() rules). That's when the retry loop earns its keep. Three attempts is my default; past that, you have a prompt problem, not a transient one, and retrying just burns money.

One caveat people miss: pin the model version. gpt-4.1 floats; gpt-4.1-2025-04-14 does not. When a schema starts failing after a silent model rollout, you want a one-line diff to bisect, not a mystery.

Tool calling: letting the model act

Structured output gets data out of the model. Tool calling lets the model reach into your systems — query a database, hit an API, look up an order. The mechanics are the same schema machinery pointed in the other direction.

The loop is always four steps, regardless of provider:

You send the model a list of tools, each with a name, description, and JSON Schema for its arguments.
The model decides whether to call a tool and, if so, generates arguments that conform to that schema.
You execute the tool — the model never runs anything. It only emits a request.
You feed the result back as a message, and the model continues until it produces a final answer.

That third step is the one that trips people up. The model's "tool call" is just structured output describing an intention. Your code owns execution, which means your code owns the guardrails.

Here's a complete loop with the OpenAI SDK. Note that the tool argument schema is, again, a Zod-derived JSON Schema — the same discipline.

import OpenAI from "openai";
import { z } from "zod";
import { zodToJsonSchema } from "zod-to-json-schema";
 
const client = new OpenAI();
 
const GetOrderArgs = z.object({ orderId: z.string().regex(/^ord_[a-z0-9]{12}$/) });
 
const tools = [
  {
    type: "function" as const,
    function: {
      name: "get_order_status",
      description: "Look up the current status of an order by its ID.",
      parameters: zodToJsonSchema(GetOrderArgs),
      strict: true,
    },
  },
];
 
async function getOrderStatus(orderId: string) {
  // Real DB call lives here. Validate args before you trust them.
  const { orderId: id } = GetOrderArgs.parse({ orderId });
  return { orderId: id, status: "shipped", eta: "2026-06-19" };
}
 
async function runAgent(userMessage: string) {
  const messages: OpenAI.Chat.ChatCompletionMessageParam[] = [
    { role: "user", content: userMessage },
  ];
 
  for (let turn = 0; turn < 6; turn++) {
    const res = await client.chat.completions.create({
      model: "gpt-4.1-2025-04-14",
      temperature: 0,
      messages,
      tools,
    });
 
    const msg = res.choices[0].message;
    messages.push(msg);
 
    if (!msg.tool_calls?.length) return msg.content; // final answer
 
    for (const call of msg.tool_calls) {
      let output: string;
      try {
        const args = JSON.parse(call.function.arguments);
        if (call.function.name === "get_order_status") {
          output = JSON.stringify(await getOrderStatus(args.orderId));
        } else {
          output = JSON.stringify({ error: "unknown_tool" });
        }
      } catch (e) {
        output = JSON.stringify({ error: String(e) });
      }
      messages.push({ role: "tool", tool_call_id: call.id, content: output });
    }
  }
  throw new Error("Tool loop exceeded max turns");
}

Two things I want to call out. First, the for (turn...) cap. Without a hard turn limit, a confused model can loop forever calling the same tool — I've watched one spend $40 in tokens in ninety seconds before a circuit breaker tripped. Six turns is plenty for most workflows; instrument and tune. Second, I return tool errors back to the model as JSON rather than throwing. The model can often recover — re-ask for a valid order ID, or pick a different tool — instead of crashing the loop.

Anthropic's API is structurally identical: you pass a tools array where each tool has a name, description, and input_schema; the model responds with a tool_use content block; you reply with a tool_result block referencing the same tool_use_id. If you've internalized the OpenAI loop, the Anthropic one is a mechanical translation. Both providers' docs are worth reading end to end because the edge cases (forced tool choice, streaming partial tool calls) differ in the details.

Parallel tool calls

Modern models return multiple tool calls in a single turn when the calls are independent. If a user asks "compare the weather in Lisbon and Belgrade," the model emits two get_weather calls at once. Run them concurrently:

const results = await Promise.all(
  msg.tool_calls.map(async (call) => {
    const args = JSON.parse(call.function.arguments);
    const output = await dispatch(call.function.name, args);
    return { role: "tool" as const, tool_call_id: call.id, content: JSON.stringify(output) };
  }),
);
messages.push(...results);

Two practical notes. You must push a tool message for every tool_call_id the model returned, or the next request 400s with a mismatched-tool-call error — partial responses are not allowed. And if you don't want parallelism (say, your tools have side effects that must be ordered), set parallel_tool_calls: false rather than trying to serialize after the fact.

Determinism and idempotency

temperature: 0 reduces variance but does not give you determinism. Floating-point non-associativity across GPU batches means identical inputs can still diverge occasionally; OpenAI's seed parameter plus the system_fingerprint gets you closer but is explicitly best-effort, not a guarantee. Stop trying to make the model deterministic. Make your tools idempotent instead.

If a tool charges a card or sends an email, the model calling it twice — across a retry, a duplicated parallel call, or a loop bug — must not double-charge. Pass an idempotency key derived from the tool-call ID:

async function chargeCard(args: { amount: number; customerId: string }, callId: string) {
  return stripe.charges.create(
    { amount: args.amount, currency: "eur", customer: args.customerId },
    { idempotencyKey: `tool_${callId}` },
  );
}

Treat every tool that mutates state as if the model will call it twice, because eventually it will. This is the single highest-leverage guardrail in the whole stack.

Guardrails checklist

Before you ship anything that lets a model call tools against real systems:

Schema-enforce, don't prompt-beg. Use strict: true Structured Outputs or schema-bound tools, never "please return JSON."
Validate again in code. Schema enforcement covers structure; your Zod .refine() rules cover business logic.
Cap the loop. Hard turn limit plus a token/cost circuit breaker.
Make mutating tools idempotent. Idempotency key per tool-call ID.
Never give the model raw execution. It emits intent; your code decides whether to honor it. Scope DB queries and API scopes to least privilege — this is straight out of the OWASP Top 10 for LLM Applications (LLM06: Excessive Agency).
Return errors to the model, crash on nothing. Let it recover when it can.
Pin model versions. Float in dev, pin in prod.

The mental model that matters: the LLM is a fancy, occasionally-wrong source of structured values and intentions. It is never the thing that touches your database. Once you internalize that boundary, structured outputs and tool calling stop being magic and become what they actually are — a typed RPC interface with a probabilistic client on the other end. You already know how to build reliable systems on unreliable clients. This is that, again.