Choosing the Right AI Model

Match the model to the workload. A three-question framework for deciding when to use GPT-5.4, GPT-5, GPT-5.4 Mini, Claude, or Groq — and how to validate your choice with a side-by-side test.

Why Model Choice Is The Biggest Lever

Model choice is the single largest determinant of both quality and cost on Macha. The same instructions, same tools, and same workflow will produce dramatically different results depending on which model is behind the agent. Pick wrong on the high side and you burn five times the credits for no perceptible quality gain. Pick wrong on the low side and you ship an agent that misses obvious things and slowly erodes your team's trust.

The good news: model choice is not a gut call. There are three questions that get you to the right answer almost every time, and one cheap test that confirms it.

The Models

OpenAI GPT-5.4 — Five credits per response

The most capable model on Macha. Best for agents that need to reason across many pieces of context — long ticket threads, multiple tool results, or complex routing logic where the right answer depends on subtle signals. Use this for anything customer-facing that goes out without a human review step, anything billing-related, and anything where a wrong answer is expensive.

Strengths: best-in-class reasoning, 1M context window (so it can read entire ticket histories without truncation), strong tool-use accuracy, supports image vision.

Trade-off: it costs five times as much as the Mini model per response and runs a little slower. If you are paying for GPT-5.4 quality but the workflow is just "read a ticket and add a tag," you are over-spending.

OpenAI GPT-5 — Three credits per response (default)

The sweet spot for most agents. Powerful enough to handle multi-step workflows with tool use, careful enough to follow detailed instructions, cheap enough to run at scale. New agents default to GPT-5 for a reason — start here unless you have a specific reason not to.

Stays on this tier: triage agents, routing agents, internal-only research agents, and most ticket-reply agents that go through human approval. Also a reasonable default for sub-agents called from a parent.

OpenAI GPT-5.4 Mini — One credit per response

Fast and cheap. Excellent for high-volume workflows where the per-step reasoning is straightforward — categorising tickets by tag, looking up an order ID, summarising a thread, transcribing audio. Strong instruction-following for a mini model, and it supports image vision.

One caveat to know: mini models occasionally drop complex nested objects when calling tools with deeply nested parameter schemas. If you build a custom tool that takes a big nested config object and notice the mini model is calling it with fields missing, that is the cause — switch to GPT-5 for that specific agent or restructure the tool to take flat parameters.

Anthropic Claude Sonnet

Strong at nuanced writing, careful reasoning, and following long, complex instructions without skipping rules. Often the best choice for customer-facing replies where tone and empathy matter, and for agents that work from long, structured policy documents. Supports image vision.

If you have an instruction set with twelve numbered steps and several conditional branches, Claude tends to follow the structure more reliably than equivalent OpenAI models. The trade-off is slightly less consistent tool use in some edge cases.

Groq (Llama)

Extremely low latency — useful when you need a response in under a second, like a Slack bot that has to feel snappy in a real-time conversation. Trade-off: Groq models do not support image vision, so they will return a "not supported" message if a tool returns an image. Use Groq for text-only, latency-sensitive work.

The Three-Question Decision Framework

Walk through these in order. The first one that returns a hard answer settles the choice.

Does this agent send anything to a customer without human review? If yes, default to GPT-5.4 or Claude Sonnet. The cost of one bad reply at scale dwarfs the credit savings of running a mini model. The internal-notes-only testing pattern (covered in Testing Best Practices) lets you defer this decision until you have proof the agent's drafts are good enough.
Does this agent need to read images? If yes, exclude Groq. Any of the OpenAI models or Claude work.
Is the workflow simple and high-volume? If yes — pure read tools, simple categorisation, internal-only writes — drop to GPT-5.4 Mini and save 80% on credits. The classic example is a triage agent that runs on every new ticket.

If none of those apply: stay on GPT-5. It is the default for a reason.

Validate Your Choice With A Side-By-Side Test

The cheapest experiment in agent design is running the same agent on two models in parallel for a small batch of test cases. Do this before committing to a model in production.

Concretely: create the agent on the model you think is right. Use the Test Run feature to run it against ten varied real-world inputs (mix of easy, hard, and edge cases). Note the outputs. Now duplicate the agent, switch only the model, and run the same ten inputs. Compare side by side.

If the cheaper model's outputs are indistinguishable from the more expensive one's, you have just unlocked a cost saving of two-to-five times. If the cheaper one is meaningfully worse, you have proof that the upgrade is worth it. Either outcome is a win — you stop making the choice on intuition.

The Math Of Model Choice

To put numbers behind the trade-off: an agent that runs on every new ticket at 1,000 tickets a month consumes:

GPT-5.4 Mini: ~1,000 credits / month
GPT-5: ~3,000 credits / month
GPT-5.4: ~5,000 credits / month

(Approximate — exact cost depends on tool-call rounds and message length.) On the Pro plan that is the difference between staying well under your credit budget and running over. That is the lever. Use it deliberately.

Tip

Different models consume different credit amounts per response. Check the credit cost shown next to each model in the dropdown when making your selection. Enterprise plans bypass credit limits entirely, so model choice on Enterprise is purely a quality and latency decision.

Changing Models On A Live Agent

You can change an agent's model at any time. The change applies to the next conversation — in-flight conversations finish on whatever model they started on. There is no migration step, no compatibility check, no downtime. This means it is genuinely safe to A/B model choice on a live agent: pick a small subset (one brand, one tag), switch the model, and watch.

If quality drops, switch back. If credits drop with no quality loss, keep the new choice and roll out to wider traffic.

Previous ← Overview

Next Writing Instructions →