Common Pitfalls

The catalogue of agent failure modes we see most often. Read this before you go live — and again whenever you ship a new agent.

The Failure Modes We See Most Often

This page is the catalogue of agent failure modes we see most often across customer engagements. Read it before you go live — and again whenever you ship a new agent. Most of these are subtle in isolation but compound badly in production.

The pitfalls are roughly ordered from most-to-least common. The first three account for the majority of bad agent incidents.

Over-Permissive Tool Sets

What it looks like: An agent has been assigned every tool from a connector "just in case." It only ever needs three of them, but the others are sitting there for the model to consider on every call.

Why it hurts: Each extra tool is a way the agent can surprise you. Tools also consume tokens (their schemas are part of the prompt) — every additional tool slightly increases per-conversation cost. Most damagingly, an over-permissive agent is hard to debug: when it does something weird, you cannot tell whether the issue is instruction-driven or tool-driven, because the surface area is too large.

How to avoid: When configuring an agent, ask for each tool: "Can I name a specific scenario in this agent's workflow where this tool will be called?" If you cannot, remove it. See the Tool Selection page for the recommended sets per common role.

Vague Instructions

What it looks like: "You are a helpful support agent. Answer customer questions accurately and politely." Nothing about the workflow, the tools, the boundaries, or the tone.

Why it hurts: Vague instructions force the model to improvise. Improvisation is where bad outputs come from. An agent told only to "be helpful" will generate plausible-sounding replies that may or may not be correct, may or may not match your brand, may or may not stay within scope.

How to avoid: See the Writing Instructions page. The short version: identity, numbered workflow, exact tool names, explicit boundaries, escape hatches. Default to writing more, not less.

Skipping The Chat Phase

What it looks like: Agent created on Monday, trigger enabled on Monday afternoon, first bad customer reply by Tuesday morning.

Why it hurts: Chat mode is where you catch instruction-level bugs at the lowest possible cost. Skipping it means those bugs surface in production where they are visible to customers.

How to avoid: See the Testing Best Practices page. Run at least ten chat conversations before any trigger. Run Test Run on at least ten varied inputs after that. Run internal-notes-only mode for a week before exposing customer-facing writes. None of these stages is optional for production-bound agents.

Wrong Model For The Workload

What it looks like: A high-volume tag-only-categorise agent running on GPT-5.4 (5 credits per response), or a customer-facing reply agent running on GPT-5.4 Mini (1 credit) and giving low-quality replies.

Why it hurts: The first wastes credits. The second wastes customer trust. Both come from picking a model based on default or habit rather than the workload.

How to avoid: See the Choosing a Model page. Use the three-question framework. A/B test before committing.

Trigger Conditions Too Broad

What it looks like: Trigger fires on every new ticket. Agent runs on tickets it was not designed for. Credit burn is high. Customer-facing writes happen on tickets that should have stayed silent.

Why it hurts: Almost every credit-burn incident traces back to this. Almost every "the agent replied to a ticket it shouldn't have" incident also traces back to this.

How to avoid: Add filters. Tags, brands, groups, custom field values. Start narrow. Expand only after a week of clean data.

Letting The LLM Improvise On Missing Data

What it looks like: The workflow expects a custom field (say, customer_tier) to be set. The field is empty. The agent improvises a value or proceeds as if the customer is "standard tier" without checking.

Why it hurts: The agent confidently does the wrong thing. The team sees a reply that assumed an enterprise customer was standard tier, or vice versa, and trust drops.

How to avoid: Write explicit fallbacks for missing data. "If customer_tier is empty, add an internal note tagging the team and stop." Do not assume the agent will gracefully handle missing inputs unless you tell it how.

No Human Review During Early Life

What it looks like: Agent goes from "just created" to "autonomously replying to customers" without ever running in internal-notes-only mode. Bad replies ship before anyone sees them.

Why it hurts: The first 100 production conversations are when an agent's actual quality reveals itself. If those 100 are public-facing, your customers see every mistake. If they are internal notes, only your team does.

How to avoid: Internal-notes-only mode. One full week, including a weekend. See Testing Best Practices.

Forgetting To Monitor

What it looks like: Agent shipped six months ago, worked great at launch, no one has looked at its conversations since. Quality has slowly drifted but no one noticed because no one was looking.

Why it hurts: Drift is silent. The agent does not "fail" — it just gradually does worse. By the time someone notices, the team has been compensating manually for weeks.

How to avoid: Calendar reminder. Weekly for the first month after launch, then monthly. Spot-check five recent conversations. See the Iterating page.

Custom Tools With Deeply Nested Parameters

What it looks like: A custom HTTP tool that takes a deeply nested config object. The agent on GPT-5.4 Mini calls it with several fields silently dropped, and the API call fails or — worse — succeeds with partial data.

Why it hurts: Mini models have a known weakness with deeply nested tool parameters. The error mode is silent: the model produces a parameter object that looks right but is missing nested fields.

How to avoid: For agents that use complex custom tools, prefer GPT-5 over Mini. Or restructure the tool to take flat parameters instead of nested objects.

Closed Tickets Get Touched

What it looks like: Agent calls zendesk_update_ticket_status or zendesk_add_public_reply on a closed ticket. Zendesk rejects the call with a vague error. The agent retries or gives up confused.

Why it hurts: Closed tickets in Zendesk are immutable. Any write fails. The agent wastes calls and produces no useful action.

How to avoid: Two layers of defense. First, in instructions: "Never close a ticket. Mark as solved instead, and check the ticket status before any write — if it is already closed, do nothing." Second, narrow your trigger conditions to exclude closed-ticket events.

Hallucinated IDs

What it looks like: The agent calls zendesk_assign_ticket with an assignee_id the model invented. The call fails with a 404.

Why it hurts: The model has no way to know IDs unless you tell it. When the instructions say "assign to Sarah" and there is no zendesk_search_users tool available, the model will sometimes invent an ID rather than admit it cannot do the task.

How to avoid: Always pair ID-using tools with their lookup counterparts. zendesk_assign_ticket needs zendesk_search_users and zendesk_list_groups. zendesk_update_ticket_fields needs zendesk_get_ticket_fields. Make the dependency explicit in instructions: "Always look up IDs before using them. Never guess."

Tag Pollution

What it looks like: Agent generates tags from ticket content (e.g., the customer's name, a product version string, a free-text observation). Months later your tag namespace has thousands of one-off tags no one uses.

Why it hurts: Tags are a global namespace in Zendesk. Polluting it makes filtering harder and the tag UI sluggish.

How to avoid: Constrain tag-writes to a fixed vocabulary. "When tagging, use only one of the following tags: billing, technical, account, shipping, refund-request, escalation. Never invent new tags."

Confirmation Gates Bypassed By Mistake

What it looks like: A write tool that should require confirmation gets enabled on an autonomous trigger without anyone realising the agent will now perform that action without human approval.

Why it hurts: The chat-mode confirmation gate is the safety net that catches bad writes during testing. In autonomous mode, that gate does not exist. Tools that you would never approve in chat will fire automatically.

How to avoid: Before enabling a trigger, look at every write tool the agent has and ask: "Am I OK with this firing without approval?" If no, remove the tool or stay in chat mode.

Conflicting Multi-Instance Configurations

What it looks like: Two Zendesk instances connected. Agent has tools from both instances. Instructions don't specify which one to use. Agent picks the wrong one.

Why it hurts: Agent reads from the production instance, writes to the sandbox. Or vice versa. Either way, the action ends up in the wrong place.

How to avoid: Be explicit in instructions about which instance the agent uses. Double-check the assigned tools list. If an agent only needs one instance, only assign tools from that instance.

Connector Auth Failures Going Unnoticed

What it looks like: A Zendesk OAuth token expires or is revoked. Macha auto-deactivates the agents using it. The team does not check email, does not notice the deactivation. Tickets pile up unrouted for days.

Why it hurts: Auto-deactivation is the right behavior — keeping a broken connector running would be worse — but if no one watches the inbox, the silence looks like everything is fine.

How to avoid: Make sure at least one admin actually reads the auth-failure emails. If your team is at risk of missing them, set up a Slack notification or a separate forwarding rule. Reconnecting takes one minute; noticing the disconnect is the hard part.

Treating The Pre-Launch Checklist As Optional

What it looks like: "We don't need to run all those checks, this is just a small change." The small change ships, breaks something, and the post-mortem reveals the change wasn't actually small.

Why it hurts: The pre-launch checklist exists to catch the things you forgot. Skipping it because you "know what you are doing" is when the surprises happen.

How to avoid: Use the checklist on every launch, including small ones. It takes ten minutes. See the Pre-Launch Checklist page.

Previous ← The Zendesk Playbook

Next Pre-Launch Checklist →