Iterating on Live Agents

Version history, monitoring drift, A/B comparing model and prompt changes, and knowing when to roll back rather than patch forward.

Agents Drift

An agent that worked perfectly at launch can drift over time. The model's behavior shifts subtly with provider updates. The underlying tickets evolve as your product changes. New customer segments arrive with patterns the agent was not tested on. The instructions you wrote six months ago no longer match what you actually want.

Iterating well — versioning instructions, monitoring quality, knowing when to roll back — is how good agents stay good in production. This page is about the maintenance discipline.

Version History

Every change you make to an agent's instructions is saved as a version. You can view the history, diff any two versions side by side, and restore a previous version with one click.

Use this aggressively:

Before making any non-trivial instruction change, scroll the version history and see what the previous successful version looked like.
When you notice the agent's quality has degraded after a recent edit, do not try to fix it from scratch. Roll back, then make the next change more carefully.
When you ship a meaningful change, write a one-line summary in the version description so future-you remembers what changed and why.

Small, Frequent Changes Beat Big, Infrequent Ones

A good iteration rhythm:

Notice a specific failure mode.
Make one targeted instruction change.
Run five Test Run cases that include the failure mode.
Ship to internal-notes mode (or chat mode, if the agent is not autonomous yet) for a day.
Check the drafts. Decide whether to keep or roll back.

Small changes are easier to attribute. If the agent's behavior shifts after one change, you know exactly what caused it. If you bundle five changes into one edit and the agent gets worse, you cannot tell which change broke it without unpicking everything.

A/B Testing Models

Switching models is the single biggest change you can make to an agent. The cheapest way to validate a model change is to A/B test on real traffic.

The pattern: duplicate the agent. Switch only the model on the duplicate. Use trigger conditions to send a portion of matching events to each version (e.g. by tag, brand, or even a percentage filter if your trigger source supports it). Run for a week. Compare:

Approval rate (if you are still in internal-notes mode).
Ticket-resolution rate (did the conversation actually end after the AI response?).
Credit cost per conversation.
Latency (median time from trigger to response).
Error rate (failed tool calls, malformed responses).

If the cheaper model wins on quality and cost, switch. If the more expensive model wins on quality enough to justify the cost, keep paying.

A/B Testing Instruction Changes

Same pattern works for prompt changes. Duplicate the agent, change the prompt on the duplicate, route a portion of traffic, compare. This is more rigorous than the "I think this prompt change is better" gut judgment.

Common things worth A/B testing:

A new escape hatch you added (does it fire when it should?).
A tone change ("be friendlier" vs the previous formal tone).
A workflow rewrite (numbered steps in a different order).
A new tool added to the toolset (does the agent actually use it well?).

Monitoring Drift

Set a recurring calendar event — weekly for the first month after launch, monthly afterwards — to spot-check the agent's recent conversations. Look at five random conversations from the last week and ask:

Would I have responded the same way?
Is the tone still on-brand?
Are the tool calls efficient (no obvious wasted rounds)?
Has the credit-per-conversation cost crept up?
Are there new failure modes that did not exist at launch?

This is not a substitute for incident monitoring (real-time alerts when things break). It is the slower discipline of catching the things that have gone subtly wrong without breaking — the agent that started giving 4-paragraph replies when 2 used to be normal, or started escalating fewer cases than it should.

The "Roll Back, Then Investigate" Default

When an agent's quality drops sharply and it is not obvious why, the right first move is almost always to roll back to the last known-good version. Investigate from a stable state, not while the agent continues to degrade.

The opposite instinct — patching the live agent — almost always makes things worse. The patch addresses the symptom you noticed; the underlying cause produces a new symptom; you patch that; etc. Two days later you have a heavily patched instruction set and no idea what the original problem was.

Roll back first. The version history makes this safe. Investigate from the rollback. Make the real fix on a duplicate. A/B test before re-shipping.

Knowing When To Retire An Agent

Some agents reach a point where the workflow they were designed for has changed enough that the agent is no longer the right tool. Signs:

The instructions have been patched so many times that no one fully understands the current behavior.
The agent's approval rate has been declining for weeks despite repeated tweaks.
The tickets it matches have shifted in character so much that the original purpose no longer applies.
You find yourself writing more escape hatches than core workflow.

When this happens, retiring the agent and rebuilding from scratch is faster than patching. Take what worked from the existing version (the helpful escape hatches, the validated tone), put it in the trash drawer for the new agent's instructions, and start fresh with a clean slate. Sometimes the best refactor is a rewrite.

Logging Changes For The Team

When multiple people maintain agents, instruction changes from one person can confuse another. Two practices that help:

One owner per agent. Even if anyone can edit, designate one person responsible for the agent's overall direction. Changes from others go through that person.
Public change notes. When you ship an instruction change, post it in a shared Slack channel or doc with one sentence on what changed and why. Future you will thank current you.

This sounds heavyweight. It is not — it takes 30 seconds per change and prevents the "wait, why did the agent suddenly start escalating everything?" detective work two weeks later.

Previous ← Testing Best Practices

Next The Zendesk Playbook →