LLM Agents That Actually Work: Tools, Loops, and Guardrails

A team I reviewed last quarter had built an "autonomous research agent." Forty-three files, a planner, a critic, a memory subsystem, a vector store. It answered three questions per minute when it worked and cost $0.60 per answer when it didn't loop forever. I replaced it with a 90-line function: one model call, one tool, a JSON schema on the output. Same task, 8x cheaper, p99 latency down from 40 seconds to 4. Nobody missed the architecture.

That is the most important thing I can tell you about agents: most of the time you don't need one. The word "agent" has been stretched to cover everything from a for loop to a fully autonomous system, and the stretching is where the money and the incidents come from. So let me draw a hard line first, then show you how to build the real thing when the task actually demands it.

What an agent actually is

An LLM agent is a loop. The model looks at the current state, decides what to do, calls a tool, observes the result, and repeats — until it decides it's done or you stop it. That's it. The defining property is that the model controls the control flow. It chooses the next step at runtime based on what it just saw. You don't know the sequence of tool calls in advance, because the model is deciding them.

Contrast that with a workflow, where you control the flow. You write the steps: extract the invoice fields, validate them against the schema, look up the vendor, post to the ledger. The model does one bounded job at each step — extraction, classification, a rewrite — and your code routes between them. The model never decides what happens next.

This distinction is the whole game. Anthropic's and OpenAI's agent-building guidance both land in the same place: reach for a fixed workflow first, and escalate to an open-ended loop only when the task is genuinely hard to specify ahead of time. The reason is operational, not aesthetic. A workflow is debuggable — you log every branch, write a test per step, and reason about cost because your code bounds the number of model calls. An agent loop is a probability distribution over execution traces. You are signing up to debug something nondeterministic.

So before you build a loop, run this gate. Build an agent only when all of these are true:

The task is multi-step and you genuinely cannot enumerate the steps in advance.
The value of the outcome justifies higher latency, higher cost, and a fatter failure surface.
Errors are catchable and recoverable — tests, review, rollback, a human in the path.

If any answer is "no," write a workflow or a single call. "Extract the title from this PDF" is a function call. "Triage this stack trace across an unknown set of repos and propose a fix" is plausibly an agent. Most tickets are the former wearing the latter's clothes.

The control loop, minimally

When you do need the loop, the loop itself is small. Here is a complete, bounded agent in TypeScript against the Anthropic SDK. No framework. The guardrails are the point — read them, not the happy path.

import Anthropic from "@anthropic-ai/sdk";
 
const client = new Anthropic();
 
const MAX_STEPS = 12;          // hard cap on iterations
const MAX_USD = 0.50;          // per-run budget ceiling
 
// Opus 4.8 pricing: $5 / 1M input, $25 / 1M output
const COST = { in: 5 / 1e6, out: 25 / 1e6 };
 
type ToolImpl = (input: any) => Promise<string>;
 
async function runAgent(
  goal: string,
  tools: Anthropic.Tool[],
  impls: Record<string, ToolImpl>,
) {
  const messages: Anthropic.MessageParam[] = [
    { role: "user", content: goal },
  ];
  let spentUsd = 0;
 
  for (let step = 0; step < MAX_STEPS; step++) {
    const res = await client.messages.create({
      model: "claude-opus-4-8",
      max_tokens: 4096,
      system:
        "You are a task-completing agent. Use tools when they help. " +
        "When the task is complete, reply with a final answer and no tool call.",
      tools,
      messages,
    });
 
    spentUsd +=
      res.usage.input_tokens * COST.in + res.usage.output_tokens * COST.out;
    if (spentUsd > MAX_USD) {
      throw new Error(`run aborted: cost cap $${MAX_USD} exceeded at step ${step}`);
    }
 
    messages.push({ role: "assistant", content: res.content });
 
    if (res.stop_reason !== "tool_use") {
      const text = res.content.find((b) => b.type === "text");
      return { answer: text?.type === "text" ? text.text : "", steps: step + 1, spentUsd };
    }
 
    // Execute every tool_use block, return ALL results in one user turn.
    const toolResults: Anthropic.ToolResultBlockParam[] = [];
    for (const block of res.content) {
      if (block.type !== "tool_use") continue;
      const impl = impls[block.name];
      try {
        if (!impl) throw new Error(`unknown tool ${block.name}`);
        const out = await impl(block.input);
        toolResults.push({ type: "tool_result", tool_use_id: block.id, content: out });
      } catch (err) {
        // Feed the error back as a result — don't crash the loop.
        toolResults.push({
          type: "tool_result",
          tool_use_id: block.id,
          content: `error: ${(err as Error).message}`,
          is_error: true,
        });
      }
    }
    messages.push({ role: "user", content: toolResults });
  }
 
  throw new Error(`run aborted: hit MAX_STEPS=${MAX_STEPS} without finishing`);
}

Three things in there are non-negotiable and routinely missing in the wild:

The step cap. Without MAX_STEPS, a confused model that keeps re-reading the same file or re-running the same failing command will loop until something else kills it — usually your timeout or your bill. Twelve is a starting point; tune it per task and alert when runs hit the cap, because hitting the cap is a signal the task was underspecified or the tools are wrong.

The cost cap. I track spend per run by accumulating usage off each response and abort when it crosses a ceiling. This is the single cheapest insurance you can buy. One malformed prompt that makes the model thrash can otherwise turn into a four-figure surprise. (If you're on a model that supports them, server-side task budgets push a token countdown into the model so it self-moderates — but that's a nice-to-have on top of your own hard cap, not a replacement for it.)

Errors go back into the loop as results, not exceptions. When a tool throws, I return a tool_result with is_error: true instead of unwinding the stack. The model sees "error: file not found" and adapts — tries a different path, asks for clarification, gives up gracefully. Crash the loop on every tool failure and your agent is brittle in exactly the situations agents are supposed to handle.

Tool design is where agents succeed or fail

The model is only as good as the tools you hand it, and tool design is mostly an exercise in restraint. Three rules I hold to.

Make tools small and single-purpose. A database_query tool that takes arbitrary SQL is a tool you cannot reason about. A get_order_by_id tool is one you can gate, cache, log, and rate-limit. The narrower the tool, the more your harness can do around it.

Write the description for the model, and be prescriptive about when. The newer Opus models reach for tools more conservatively than their predecessors, so a description that only states what a tool does underperforms one that states when to call it. "Look up current inventory. Call this whenever the user asks about stock, availability, or 'do you have'." That trigger clause measurably raises the should-call rate.

Keep the input schema strict. Use JSON Schema with additionalProperties: false, real enums, and a tight required list. Here's a tool definition that is gateable and hard to misuse:

{
  "name": "issue_refund",
  "description": "Refund a charge to the original payment method. Call this only after the user has explicitly confirmed the amount and the order ID. Hard-to-reverse — the harness will require human approval.",
  "input_schema": {
    "type": "object",
    "additionalProperties": false,
    "properties": {
      "order_id": { "type": "string", "pattern": "^ord_[a-zA-Z0-9]{16}$" },
      "amount_cents": { "type": "integer", "minimum": 1, "maximum": 50000 },
      "reason": { "type": "string", "enum": ["defective", "not_received", "duplicate", "other"] }
    },
    "required": ["order_id", "amount_cents", "reason"]
  }
}

The maximum on amount_cents is a guardrail, not a UX nicety: even if the model is jailbroken into trying to refund $9,000, the schema caps it. Defense in depth means the dangerous capability is bounded at the type level, before your code ever runs.

This is also where the bash-versus-dedicated-tool decision lives. A bash tool gives the model enormous reach — it can do almost anything — but it hands your harness an opaque command string. A dedicated issue_refund tool gives the harness a typed, named action it can intercept, render in a confirmation modal, and audit. Rule of thumb: start with bash for breadth during prototyping, then promote any action that needs gating, custom UI, or an audit trail into its own tool. Reversibility is the test — hard-to-reverse actions (sending email, deleting data, moving money) earn a dedicated tool every time.

Guardrails: assume the model will misbehave

Treat the agent loop as untrusted code execution, because that's what it is. The OWASP Top 10 for LLM Applications names the failure modes you actually hit in production — prompt injection (LLM01), excessive agency (LLM06), and unbounded consumption (LLM10) are the agent-specific ones. Design against them:

Risk	What goes wrong	Guardrail
Runaway cost	Model loops, re-calling tools forever	Per-run step cap and USD cap, alert on hitting either
Unsafe actions	Model deletes data, sends email, moves money	Allowlist of tools; human approval on irreversible ones
Prompt injection	Tool output contains "ignore previous instructions…"	Treat all tool output as untrusted data, never as instructions; never put secrets in the prompt
Context bloat	Transcript grows past the window, cost climbs per step	Cap steps; clear or compact stale tool results
Sandbox escape	Bash/code tool touches the host	Run tools in an isolated container, deny-by-default egress

The two I see skipped most often: allowlisting and sandboxing. Allowlisting means the agent can only invoke the specific tools you registered for this run — not "every tool in the system." Sandboxing means tool execution happens somewhere the blast radius is contained. If you're letting a model run shell commands, run them in a container with no host filesystem access and deny-by-default network egress:

# Run the agent's tool executor in a throwaway container.
# No host mounts, capabilities dropped, network off unless explicitly allowed.
docker run --rm \
  --network none \
  --read-only \
  --cap-drop ALL \
  --memory 512m --cpus 1 \
  --tmpfs /work:rw,size=256m \
  agent-sandbox:latest /work/run-tool.sh

That --network none plus --cap-drop ALL is the difference between a misbehaving agent that wastes some tokens and one that exfiltrates your environment variables. Prompt injection is real: the moment a tool fetches a web page or reads a user-supplied file, that content can carry instructions aimed at your model. You can't fully prevent the model from reading them — your defense is that the model can't do anything dangerous even if it's convinced to try, because the tools are allowlisted and the sandbox is locked down.

Planning, memory, and the MCP question

Two more design axes, briefly, because they're where people over-build.

Planning vs. reactive. A reactive agent — the loop above — decides one step at a time. A planning agent first writes a plan, then executes it. Planning helps on long-horizon tasks where the model would otherwise lose the thread, and the newer models are good enough that an explicit "make a plan, then execute it" instruction in the system prompt often beats a separate planner module. Don't build a planning subsystem before you've tried a sentence.

Memory and state. Within a run, your messages array is the state — keep it lean by clearing tool results you no longer need. Across runs, if the agent must remember things, persist to a store and load the relevant slice at the start of the next run. Don't reach for a vector database on day one; a plain file or a row in Postgres covers most "remember the user's preferences" cases, and it's far easier to inspect when something goes wrong.

MCP. The Model Context Protocol is the emerging standard for exposing tools to LLM hosts. Instead of hand-wiring each integration into every app, you run an MCP server that advertises its tools, and any MCP-aware host can discover and call them. The payoff is reuse: write a GitHub or Postgres tool server once, and it works across Claude Code, your own agent, and any other MCP client. If you're building more than one agent, or you want your tools usable from someone else's host, MCP is worth adopting. The guardrails in this post don't change: an MCP tool is still a tool, and it still needs an allowlist, a sandbox, and a description that says when to call it.

The checklist

Before you ship anything you're calling an agent, run this:

Did you need an agent at all? If you can name the steps, it's a workflow. Build that instead.
Is there a hard step cap and a hard cost cap, with alerts on both?
Is the tool set allowlisted per run — not "everything the system can do"?
Do irreversible tools require human approval, and are they bounded at the schema level?
Does tool execution run in a sandbox with no host access and deny-by-default egress?
Is all tool output treated as untrusted data, never as instructions, with no secrets in the prompt?
Do tool errors flow back into the loop as results instead of crashing it?
Can you replay and debug a single run from its logged transcript?

Agents are the most powerful and the most dangerous pattern in this space. The power is real when the task genuinely needs runtime decision-making. The danger is that a nondeterministic loop with tools is a system that can spend your money and take real actions while you sleep. Reach for the simplest thing that works, bound everything, and earn the loop. Most of the time, the function call was the right answer all along.