Agent Evaluation — Grade How Your Agents Perform

Grade how your agents actually performed on past conversations. An AI judge runs a checklist over every chat, trigger run, and sub-agent handoff and gives you a percentage-followed score, per-conversation verdicts, and a one-line reason for each call.

What Agent Evaluation is

Agent Evaluation is a built-in way to grade how your agents are actually performing — not whether they ran, but whether they did the right thing. You set up a checklist once (or have Macha build it from the agent's own instructions), and an AI judge runs that checklist over every past conversation the agent has had: chat sessions, autonomous trigger runs, sub-agent handoffs, embedded chatbot conversations. You get back a graded view of how often the agent followed its instructions, where it slipped, and a short reason for each verdict so you can audit the call without re-reading the whole conversation.

It lives under the Agent Evaluation tab on every agent's detail page, alongside Configuration, Chat, Analytics, and History.

When to use it

Three situations where Evaluation pays off:

  • You've changed the agent's prompt and want to know whether that change actually helped, hurt, or made no difference. Run the evaluation before and after — the percentage-followed score and per-conversation verdicts tell you immediately.
  • You're onboarding a new sub-agent into an existing routing flow and need to make sure the parent agent is handing off correctly. Evaluation grades the handoff messages, the routing decisions, and any forbidden actions in one pass.
  • You're rolling out an agent to a new team or use case and want a baseline quality score before going wider. Set the evaluation up once, run it weekly, and you have a quality trend line.

Evaluation is not a live moderator — it grades past conversations, not in-flight ones. For real-time guardrails, use the agent's own instructions or a confirmation gate on write tools.

How an Evaluation is shaped

Every Evaluation has four parts:

  • An agent — the one whose past conversations you want graded. The Evaluation tab lives on the agent's detail page, so the agent is preselected; you can't accidentally run an evaluation across the wrong one.
  • A scope — which slice of that agent's conversations to grade. Pick a date range, optionally filter by source (chat, trigger, sub-agent, embed) or by which model the agent ran on. By default the last 7 days are graded.
  • Which instructions to grade against — agents have version history, so we let you choose: the instructions that were live when each conversation actually ran (the honest "did this agent do its job at the time?" view), your latest instructions (the "would today's prompt have passed?" view), or a specific saved version (for A/B comparisons between prompt iterations).
  • A grading schema — the checklist of fields the AI judge fills in for every conversation. Two fields are added for you automatically and can't be removed: Instructions followed (Yes / Partially / No) and a short Why explanation. You add the rest of the checks — for example "Right tool called", "Tone was empathetic", "Refund granted only when policy allowed".

Creating an evaluation

Open any agent and click the Agent Evaluation tab. The first time you open it you'll land on a zero state explaining the feature with a primary Create your first evaluation button. After that it shows a table of the evaluations you've already run.

Click New evaluation (or Create your first evaluation) and you'll see a three-step wizard:

  1. Scope — pick the date range, optionally filter by source and model, and choose which version of the instructions the judge should grade against.
  2. Extract — define the checklist. Two paths here:
    • Build with AI (recommended) — Macha reads this agent's instructions and proposes a set of grading fields automatically, one per rule it finds. You can edit, remove, or add to anything it proposes before running.
    • Add manually — define each field yourself: column label, answer type (Yes/No, Single choice, Multiple choice, Short text, Long text, Number), and per-field guidance telling the judge how to decide. Use this when you have a very specific rubric in mind.
  3. Review — see what's about to be graded, get a cost estimate (one credit per graded conversation, with the judge model you picked), then start the run.

The wizard runs entirely inside the agent's tab — the URL stays on the agent page the whole time, so you don't lose context if you flip away mid-flow.

The two mandatory fields

Every Agent Evaluation grades two fields automatically:

  • Instructions followed — a Yes / Partially / No verdict on whether the agent followed its instructions on this conversation. Yes means every required step was carried out correctly; Partially means it mostly followed them but made a notable mistake (wrong tag, wrong template, skipped step); No means the agent took the wrong action or didn't follow the instructions at all.
  • Why — a one-to-two sentence explanation of the verdict, pointing to a specific moment in the conversation. This is the audit trail: a reviewer can scan the percentage score, click into a row, and immediately see why the AI judge graded it that way without reading the whole conversation.

Both fields show up as the first two columns on the results grid, and the verdict drives the green/amber/red colour coding used everywhere else (the compliance pill on the table, the donut chart on the report, the headline percentage on the evaluation list).

Building the rest of the checklist with AI

When you pick Build with AI, Macha:

  1. Reads the current instructions for the agent you picked (or the specific version you pinned).
  2. Identifies the rules and constraints in those instructions — what the agent must do, what it must not do, how it must classify things.
  3. Proposes one grading field per rule. Each field comes with a column label, the right answer type, options where relevant, and judge guidance.
  4. Streams the proposals into the wizard one at a time, so you can watch it work.

This usually produces a sensible 3–7 field checklist in under thirty seconds. You can edit any field before running, or delete it and add your own.

Running an evaluation and seeing results

Click Start run on the Review step. The run grades every conversation in your scope using the judge model you picked (defaults to GPT-5). You'll see a live "0 of N processed" counter and rows appearing in real time. Cancel mid-run if you want — credits already spent stay billed, but no new ones get charged.

Once it's done you get three views:

  • Results table — one row per conversation, one column per grading field. The Instructions followed column shows the Yes/Partially/No pill (green/amber/red); the Why column has the short reasoning; the rest of your custom fields fill in next to them. Click a row to open the full record drawer with the original conversation and every field's value.
  • Report view — click View report from the results page. You get a summary card per grading field. The Instructions followed card is rendered as a three-segment donut chart (green/amber/red) with a centre percentage and a clickable legend. Other fields show as bar lists.
  • Evaluation list — back on the agent's tab, the list view shows one row per evaluation with the headline % followed pill, the judge model, the status, the count of graded conversations, and the timestamp.

The headline percentage is computed as: Yes count + ½ × Partially count, divided by the total graded. Partial credit for partial work — so a run that's 60% Yes and 40% Partially scores 80% followed, not 60%.

Re-running and iterating

From the evaluations list you can hit Run again on any past evaluation. This reopens the same wizard with the original config loaded — you can keep the scope and just re-run against the latest conversations, or change the instructions-source from "as run" to "latest" to see whether your newer prompt would have done better on the same conversations.

Common iteration loops:

  • Weekly quality check — same evaluation, same scope = last 7 days. Watch the trend.
  • Before-and-after prompt change — run once against the old version (pin "instructions in effect at the time"), edit the agent's prompt, run again against "latest" over the same date range, compare the two pills.
  • Regression catch — run on a fixed historical window after every prompt change to make sure you didn't break something the old prompt got right.

What you don't have to set up

A few things people often ask about — Agent Evaluation handles them automatically:

  • The agent is already locked in. You don't pick an agent on the Scope step — you're evaluating the agent you're already looking at.
  • The Instructions field on Step 2 (Extract) is hidden. The judge already grades against the agent's actual configured instructions via the Instructions followed field, so an extra "guidance for the judge" textbox would just be noise.
  • Sub-agent runs are graded as part of the parent conversation. The judge sees the full delegation tree — every tool call, every sub-agent reply — so a routing agent gets graded on whether its handoff was correct, not just on whether it called the right tool.
  • The judge sees the full conversation transcript by default. There's no "What the AI sees per record" picker on the Scope step for evaluations — instructions, transcript, tool calls, and metadata are all in scope, capped at the model's context budget.

Plan availability

Agent Evaluation is available on the Professional and Enterprise plans. Each graded conversation costs one credit at the judge model's per-record rate (e.g. 3 credits/record on GPT-5). You see the estimate before starting the run.

© 2026 AGZ Technologies Private Limited