Testing Best Practices

The four-stage testing pattern: chat first, then Test Run, then internal-notes-only mode for a full week, then gradual rollout. The single most important section in this guide.

The Single Most Important Section In This Guide

Every agent failure we have ever seen in production — every one — could have been caught in testing. The teams who avoid bad incidents are the ones who follow a deliberate testing pattern. This page is that pattern.

The pattern is four stages, in order. Each stage limits the blast radius of mistakes. Skipping stages does not save time; it shifts the cost of finding bugs from your test environment to your customers.

Stage 1 — Chat with the agent. Manual conversation, manual confirmation of every write.
Stage 2 — Test Run with real data. Simulated trigger fires, end-to-end run, no side effects on external systems.
Stage 3 — Internal-notes-only mode. Live triggers, real volume, but every customer-facing action becomes a private internal note your team reviews.
Stage 4 — Gradual rollout. Public-facing actions enabled, but on a narrow subset of traffic that expands as confidence grows.

Most agents take one to three weeks to graduate through all four stages. That sounds slow. It is much faster than the alternative: shipping bad agents, losing customer trust, and starting over.

Stage 1: Chat With The Agent

Before any trigger, before any test run, just chat with the agent. Open a conversation in the Macha dashboard and walk through the workflow manually. Paste in a real ticket ID. Ask the agent to triage it. Watch which tools it calls and what arguments it passes. Approve or reject each write action through the confirmation gate.

You will catch 80% of agent issues in the first ten chat conversations. Wrong tool calls. Missing tool calls. Misinterpreting custom field IDs. Giving customer-facing replies the wrong tone. None of these need a real autonomous run to surface.

What To Look For In Chat Mode

Does the agent call the right tools, in the right order? If it skips zendesk_get_ticket and tries to act on a ticket it has not read, the instructions need to enforce that step.
Are the tool arguments sensible? If you see ticket_id: 0 or assignee_id: 12345 when no such user exists, the agent is hallucinating IDs. Add an instruction requiring it to look up IDs first.
Does the agent confirm the right writes? The confirmation message should accurately reflect what is about to happen.
Is the tone of customer-facing drafts correct? Read every draft as if you were the customer receiving it.
Does the agent escalate appropriately? Try edge cases — tickets in other languages, tickets with no information, tickets that should be ignored. Verify the agent stops rather than improvising.
How many tool-call rounds does each conversation take? If a simple workflow takes 8+ rounds, the instructions are probably under-specifying the steps.

What To Try In Chat Mode

Run at least ten conversations covering this matrix:

The happy-path workflow on a typical ticket.
The same workflow on a ticket missing key information (no order ID, no customer email).
A ticket that should be escalated rather than handled.
A ticket in a language the agent should not respond to.
A ticket with an attachment (image, PDF, audio).
A ticket where the obvious tool would do the wrong thing — e.g. a ticket already marked solved.
A ticket from a customer with unusual formatting or non-standard punctuation.
A ticket asking for something the agent is forbidden to do (refund, pricing quote).

The point is not to pass every test — it is to find the failure modes before they ship. An agent that fails three of the eight test cases in chat mode is not ready for Stage 2; it is ready for an instruction revision.

Stage 2: Test Run With Real Data

Once the agent behaves well in chat, use the Test Run feature on the agent's configuration page. Test Run lets you simulate a trigger by feeding the agent a real entity (a Zendesk ticket, a Slack message, etc.) without actually firing the trigger or pinging external systems differently than usual.

The agent runs end-to-end as it would autonomously, but you see the result before any side effects ship. The key difference from Stage 1: confirmation gates are bypassed (just like they would be in real autonomous mode). This is what lets you see exactly how the agent will behave when running on its own.

Building A Test Suite

Run at least ten Test Run cases covering the variety of inputs you expect — easy cases, hard cases, edge cases. The same matrix from Stage 1 applies, but now you are also checking:

Does the agent complete the full workflow without human prompting? No more "and then I asked it to do X" — the autonomous run has to do X on its own.
Are the side effects what you intended? Check Zendesk, check Slack, check whatever external systems the agent wrote to.
How long does a typical run take? If most runs are 30+ seconds, you may have an instruction that causes excessive tool-call rounds.
Are the error paths handled? Force a failure (a malformed ticket, an unauthenticated connector) and verify the agent fails gracefully rather than retrying forever.

Save the inputs that exposed bugs. Re-run them after every meaningful instruction change. This is the lightweight version of a regression test suite.

Stage 3: Internal-Notes-Only Mode

This is the most important pattern in this entire guide. Before you let an agent reply to customers autonomously, run it in internal-notes-only mode for a full week.

How It Works

Configure the agent with the same tool set you plan to ship — except swap any customer-facing write tools for their internal-only equivalents. For Zendesk reply agents, the swap is:

Remove zendesk_add_public_reply
Add zendesk_add_internal_note

Update the instructions to match: "Instead of sending a public reply, post your draft response as an internal note prefixed with [AI DRAFT] so the human agent can review and post it manually. Do not send anything to the customer."

Now enable the trigger. The agent runs on every matching ticket exactly as it would in production — same model, same other tools, same reasoning, same trigger conditions — but every customer-facing action becomes a private note that only your team sees. Your support agents see the AI's draft, edit it as needed, and send it themselves. No customer ever sees a bad AI reply.

What This Pattern Gives You

A real-world quality signal. You see how the agent performs on actual tickets, not curated test cases. Real tickets have real edge cases that you cannot dream up in advance.
Human oversight at scale. Every reply is reviewed before it ships, even at high volume. The agent does the bulk of the drafting work; the human does the judgment work.
A bridge for skeptical teams. Support managers who are nervous about AI replies often warm up once they see the drafts are good and the team can edit them. This is how you build organisational trust before you remove the human review.
A reversible bet. If the agent is bad, you discover it without burning trust with customers. The worst case in internal-notes-only mode is "the team had to write more replies from scratch this week" — which is recoverable.
A faster iteration loop. When the team flags a bad draft, you can correlate it with the exact instruction or tool issue, fix it, and the next batch of drafts improves. This loop runs in days, not weeks.

How Long To Run It

Minimum: one full week, including a weekend. The character of weekend tickets is different from weekday — different topics, different volume, different urgency. An agent that does well Monday-Friday but mishandles Sunday tickets needs to be caught before going public.

Typical: one to two weeks. You want enough volume to see the long tail. If your agent matches 100 tickets a week, two weeks gives you 200 drafts to review — that is enough signal.

Long: until the team is bored of approving the drafts. When your support team is consistently saying "yeah this draft is fine, just sending it," that is the signal that you can graduate to public-reply mode.

What To Look For In The Drafts

Accuracy. Are the facts in the draft right? Does it cite the correct help centre article? Did it look up the right order?
Tone. Does it sound like your team? Does it match the customer's level of formality?
Completeness. Does it answer the customer's actual question, or only part of it?
Restraint. Does it avoid promising things you cannot deliver (refunds, escalation guarantees, deadlines)?
Format. Does it use the right greeting, sign-off, line breaks, links?
Edge cases. When the customer's situation is unusual, does the agent escalate or improvise? Improvisation is bad; escalation is good.

Track your team's approval rate. "How often did the team send the draft as-is?" is a hard quality metric. If it climbs above ~85% over two weeks of stable drafts, the agent is ready to ship public replies.

The Graduation Criteria

Three signals that you are ready to flip from internal-notes to public-reply:

Approval rate above 85% over two consecutive weeks.
No "would have sent something I would not have sent myself" cases in the last week.
The team trusts the agent enough to want to graduate it. (This is a real criterion. If the team is not sure yet, give it another week.)

When all three are true, swap zendesk_add_internal_note back to zendesk_add_public_reply, update the instruction to "post the reply as a public reply," and ship. You do not need to start with narrow trigger conditions at this point — Stage 3 already proved the agent works at full volume. You can take the existing trigger conditions straight to Stage 4.

Variations Of This Pattern

The same idea applies beyond Zendesk replies:

Slack agents: have the agent draft replies in a private channel before sending them in the actual channel.
Email agents: have the agent compose drafts in a "drafts" folder before sending.
Refund agents (Stripe / Shopify): have the agent post the proposed refund as an internal note in the corresponding Zendesk ticket, instead of actually creating the refund. A human reviews and creates the refund manually.
Routing agents: have the agent recommend a routing decision in an internal note, instead of actually reassigning. The team sees the recommendation, decides whether to act on it.

The principle is the same in every case: keep the agent doing all the upstream work (reasoning, tool calls, drafting) but redirect the final irreversible action through a human until you trust it.

The internal-notes test in one sentence

If your agent's drafts would have been good enough to send unedited 90% of the time over a full week of real tickets, ship the public-reply version. If not, you have a list of exactly what to fix.

Stage 4: Gradual Rollout

When you do flip to customer-facing actions, do not switch on every ticket at once. Use trigger conditions to limit the agent to a subset — a single brand, a single tag, tickets from a specific email domain. Watch it for a few days. Expand to the next subset. Repeat.

A typical rollout schedule for a reply agent:

Days 1-3: One brand only. ~10% of total ticket volume.
Days 4-7: Add a second brand. ~25% of volume.
Days 8-14: Open to all eligible tags across all brands. Full eligible volume.
Day 15+: Stable production. Ongoing monitoring at lower frequency.

The schedule matters less than the principle: monitor between expansions. Each expansion is a chance to catch a regression you would have missed at lower volume. Skip the monitoring step and the schedule is just an illusion of caution.

Maintaining A Test Suite

Save your Stage 1 and Stage 2 test cases. After every meaningful change to instructions, model, or tool set, re-run the saved cases. This is your regression suite — it catches the case where a fix to one bug introduces a different bug somewhere else.

You do not need a fancy framework. A shared doc with "the ten cases we test before any agent change" is enough. The discipline is more valuable than the tooling.

What Not To Skip

Common shortcut paths that do not work:

"We tested it in chat, that should be enough." Chat tests confirmation behavior; it does not test autonomous behavior. Run Stage 2.
"We did Test Run on five cases, looks good." Five cases is not enough variety to catch the long tail. Aim for ten across the input matrix.
"Internal notes mode takes too long, let's just go live with narrow conditions." Narrow conditions limit blast radius but do not catch quality issues. Internal notes catch quality issues. Both are needed.
"The team agrees the agent is good, let's skip Stage 3." Vibes are not a test. Run the week.
"We have a regression suite, we don't need to retest manually." A regression suite catches what you knew to test for. New failure modes come from things you did not anticipate. Both matter.

The four-stage pattern is slow on purpose. The cost of being slow is measured in days. The cost of being fast and wrong is measured in customer trust. There is no contest.

Previous ← Triggers and Automation

Next Iterating →