Agent Evaluation — Evaluate How Your Agents Perform

Two layers of evaluation built into every Macha workspace. Agent Auto-Eval grades every conversation automatically against the agent's instructions (free on every plan, with caps). Manual Evaluations let you define a custom rubric and run it across a chosen scope (Professional and above).

What Agent Evaluation is

Agent Evaluation is a built-in way to see how your agents are actually performing — not whether they ran, but whether they did the right thing. An AI judge grades every trigger-fired conversation an agent handles, against the agent's own live instructions, and gives you a clear view of how often the agent followed those instructions, where it slipped, and a short reason for each verdict so you can audit the call without re-reading the whole conversation.

It lives under the Agent Evaluation tab on every agent's detail page, alongside Configuration, Try it, and History.

Two flavours: Auto-Eval and Evaluations you create

Every Macha workspace gets two layers of evaluation, doing different jobs:

Agent Auto-Eval is turned on by default for every customer on every plan. The moment you create an agent and wire it to a trigger, Macha starts grading every conversation that agent handles in Agent mode, automatically, against the agent's own live instructions. You don't configure anything, and you don't pay credits for it. There's no cap — grading runs for as long as the agent runs, on trial, starter, professional, and enterprise alike.
Agent Evaluations you create are the rest of this page. The manual, configurable flow you reach for when you've changed the prompt, added a sub-agent, or want a custom rubric scored against a specific historical window. You pick the scope, define the checklist (or let Macha build one from instructions), pick the judge model, and pay credits per evaluated conversation. Available on Professional and Enterprise.

Auto-Eval and manual Evaluations share the same underlying machinery, results table, and report view. The difference is who configures it and whether it bills credits.

About Agent Auto-Eval

Auto-Eval was built for one job: tell you, without any setup, whether your agents are doing what their instructions say they should. Every autonomous (trigger-fired) conversation an agent finishes fires a single grading pass on a hardcoded GPT-5 judge model. The judge fills in five fields:

Instructions followed — Yes / Partially / No.
Why — a short justification citing the specific rules the agent followed or missed.
Resolution — Resolved, Partially resolved, Unresolved, Escalated to human, or No resolution needed.
Customer sentiment — Positive, Neutral, Negative, or Frustrated.
Category — a short topic label like Refunds, Login issues, or Shipping delay so you can see what your agents are actually spending their time on.

Auto-Eval only grades Agent-mode conversations — the trigger-fired runs where the agent is acting on customer signals in production. Chat sessions in the dashboard, widget conversations, and test runs are excluded so the eval measures the behaviour customers actually deploy. The header on the Evaluation tab spells this out with an Agent mode only pill; if you switch the agent to Chat mode or deactivate it, the pill flips to Auto-evaluation paused so you know why nothing new is landing.

Auto-Eval Studies are locked from editing and deletion. You can't change the scope, the rubric, or the judge model. That's the trade-off for it being free: the rubric is opinionated and uniform, so every customer gets the same headline adherence number, computed the same way.

The dashboard

The Agent Evaluation tab renders four things at a glance:

Adherence score card. The current adherence percentage, the Yes / Partially / No breakdown, and how the score has moved versus the previous period.
Adherence-over-time chart. A daily trend line of adherence % with a dashed volume overlay showing how many conversations were judged each day. Hover any point for the exact numbers.
Notes. Every 50 graded conversations, an AI-written note lands with the dominant patterns from that batch — what worked, the failure modes clustered by root cause, and 2–4 concrete instruction tweaks that would lift the score. Older notes stay browsable through a View all notes modal so you can compare across batches and see whether a prompt change moved the needle.
Resolution and Customer sentiment breakdowns. Bar charts of the two outcome scores so you can see, at a glance, whether the agent actually solved problems and how customers felt at the end.

Below the cards is the per-conversation results table — every graded conversation with its verdict, resolution, sentiment, category, the judge's reasoning, and a link into the conversation itself for a full audit.

Multi-turn tickets

When a customer replies to a ticket that's already been graded, and the agent handles that new turn, the new turn gets its own grade. You see how the agent held up across every fire on a ticket, not just the first one.

The adherence score is computed the same way as in manual Evaluations: Yes count + ½ × Partially count, divided by the total. A 60% Yes, 40% Partially run scores 80%.

When to use it

Three situations where Evaluation pays off:

You've changed the agent's prompt and want to know whether that change actually helped, hurt, or made no difference. Run the evaluation before and after — the percentage-followed score and per-conversation verdicts tell you immediately.
You're onboarding a new sub-agent into an existing routing flow and need to make sure the parent agent is handing off correctly. Evaluation checks the handoff messages, the routing decisions, and any forbidden actions in one pass.
You're rolling out an agent to a new team or use case and want a baseline quality score before going wider. Set the evaluation up once, run it weekly, and you have a quality trend line.

Evaluation is not a live moderator — it evaluates past conversations, not in-flight ones. For real-time guardrails, use the agent's own instructions or a confirmation gate on write tools.

How an Evaluation is shaped

Every Evaluation has four parts:

An agent — the one whose past conversations you want to evaluate. The Evaluation tab lives on the agent's detail page, so the agent is preselected; you can't accidentally run an evaluation across the wrong one.
A scope — which slice of that agent's conversations to evaluate. The Scope step shows a Filter conversations card where you set a date range and (optionally) restrict to conversations the agent ran on a specific AI model — useful when you've migrated the agent from one model to another and want to compare. By default the last 7 days are evaluated.
Which version of the AI's instructions to evaluate against — agents have version history, so you choose: the instructions that were live when each conversation actually ran (the honest "did this agent do its job at the time?" view), the AI instructions you have right now (the "would today's prompt have passed?" view), or a specific saved version (for A/B comparisons between prompt iterations).
A set of evaluation fields — the checklist of fields the AI judge fills in for every conversation. Two fields are added for you automatically and can't be removed: Instructions followed (Yes / Partially / No) and a short Why explanation. You add the rest of the checks — for example "Right tool called", "Tone was empathetic", "Refund granted only when policy allowed".

Creating an evaluation

Open any agent and click the Agent Evaluation tab. The first time you open it you'll land on a zero state explaining the feature with a primary Create your first evaluation button. After that it shows a table of the evaluations you've already run.

Click New evaluation (or Create your first evaluation) and you'll see a three-step wizard:

Scope — "What should we evaluate?" Pick the date range, optionally restrict by AI model used for the conversation, and choose which version of the AI's instructions the judge should evaluate against.
Extract — "What should the AI judge evaluate?" Define the checklist. Two paths here:
- Build with AI (recommended) — Macha reads this agent's instructions and proposes a set of evaluation fields automatically, one per rule it finds. You can edit, remove, or add to anything it proposes before running.
- Add manually — define each field yourself: field name, field type (Yes/No, Single choice, Multiple choice, Short text, Long text, Number), and per-field guidance telling the judge how to decide. Use this when you have a very specific rubric in mind.
Review — see what's about to be evaluated, get a cost estimate (one credit per evaluated conversation, with the judge model you picked), then start the run. If your filters return zero conversations, the Test and Run buttons are disabled with a one-click link back to widen the Scope step.

The wizard runs entirely inside the agent's tab — the URL stays on the agent page the whole time, so you don't lose context if you flip away mid-flow.

The two mandatory fields

Every Agent Evaluation includes two fields automatically:

Instructions followed — a Yes / Partially / No verdict on whether the agent followed its instructions on this conversation. Yes means every required step was carried out correctly; Partially means it mostly followed them but made a notable mistake (wrong tag, wrong template, skipped step); No means the agent took the wrong action or didn't follow the instructions at all.
Why — a one-to-two sentence explanation of the verdict, pointing to a specific moment in the conversation. This is the audit trail: a reviewer can scan the percentage score, click into a row, and immediately see why the AI judge reached that verdict without reading the whole conversation.

Both fields show up as the first two columns on the results grid, and the verdict drives the green/amber/red colour coding used everywhere else (the compliance pill on the table, the donut chart on the report, the headline percentage on the evaluation list).

Building the rest of the checklist with AI

When you pick Build with AI, Macha:

Reads the current instructions for the agent you picked (or the specific version you pinned).
Identifies the rules and constraints in those instructions — what the agent must do, what it must not do, how it must classify things.
Proposes one evaluation field per rule. Each field comes with a field name, the right field type, options where relevant, and judge guidance.
Streams the proposals into the wizard one at a time, so you can watch it work.

This usually produces a sensible 3–7 field checklist in under thirty seconds. You can edit any field before running, or delete it and add your own.

Running an evaluation and seeing results

Click Start run on the Review step. The run evaluates every conversation in your scope using the judge model you picked (defaults to GPT-5). You'll see a live "0 of N processed" counter and rows appearing in real time. Cancel mid-run if you want — credits already spent stay billed, but no new ones get charged.

Once it's done you get three views:

Results table — one row per conversation, one column per evaluation field. The Instructions followed column shows the Yes/Partially/No pill (green/amber/red); the Why column has the short reasoning; the rest of your custom fields fill in next to them. Click a row to open the full record drawer with the original conversation and every field's value.
Report view — click View report from the results page. You get a summary card per evaluation field. The Instructions followed card is rendered as a three-segment donut chart (green/amber/red) with a centre percentage and a clickable legend. Other fields show as bar lists.
Evaluation list — back on the agent's tab, the list view shows one row per evaluation with the headline % followed pill, the judge model, the status, the count of evaluated conversations, and the timestamp.

The headline percentage is computed as: Yes count + ½ × Partially count, divided by the total evaluated. Partial credit for partial work — so a run that's 60% Yes and 40% Partially scores 80% followed, not 60%.

Re-running and iterating

From the evaluations list you can hit Run again on any past evaluation. This reopens the same wizard with the original config loaded — you can keep the scope and just re-run against the latest conversations, or change the instructions-source from "as run" to "latest" to see whether your newer prompt would have done better on the same conversations.

Common iteration loops:

Weekly quality check — same evaluation, same scope = last 7 days. Watch the trend.
Before-and-after prompt change — run once against the old version (pin "instructions in effect at the time"), edit the agent's prompt, run again against "latest" over the same date range, compare the two pills.
Regression catch — run on a fixed historical window after every prompt change to make sure you didn't break something the old prompt got right.

What you don't have to set up

A few things people often ask about — Agent Evaluation handles them automatically:

The agent is already locked in. You don't pick an agent on the Scope step — you're evaluating the agent you're already looking at.
The Instructions field on Step 2 (Extract) is hidden. The judge already evaluates against the agent's actual configured instructions via the Instructions followed field, so an extra "guidance for the judge" textbox would just be noise.
Sub-agent runs are evaluated as part of the parent conversation. The judge sees the full delegation tree — every tool call, every sub-agent reply — so a routing agent gets evaluated on whether its handoff was correct, not just on whether it called the right tool.
The judge sees the full conversation transcript by default. There's no "What the AI sees per record" picker on the Scope step for evaluations — instructions, transcript, tool calls, and metadata are all in scope, capped at the model's context budget.

Plan availability

Agent Auto-Eval is included on every plan, including Trial, at no credit cost. Cap: 50 conversations per agent on every plan.

Manual Agent Evaluations (the wizard-driven flow described above) are available on the Professional and Enterprise plans. Each evaluated conversation costs one credit at the judge model's per-record rate (e.g. 2 credits/record on GPT-5). You see the estimate before starting the run.

Previous ← Studies - AI Analysis

Next Chat →