Why Context Window Size Matters for AI Support Agents
Every AI model has a context window — the maximum amount of information it can process in a single request. When your AI agent reads a ticket, searches your knowledge base, calls tools, and drafts a response, all of that data lives inside the context window. When it fills up, the model starts forgetting earlier information. Here's why this matters for customer support and how to think about it.
What is a context window?
A context window is the total amount of text (measured in tokens — roughly 4 characters per token) that an AI model can process in one request. Everything the model reads and generates must fit inside this window:
- The system prompt (agent instructions, tool definitions, knowledge context)
- The conversation history (all previous messages in this interaction)
- Tool results (data returned from Zendesk, APIs, knowledge base searches)
- The model's own response
If the total exceeds the context window, the model either throws an error or older information gets dropped.
How different models compare
| Model | Context Window | In practice |
|---|---|---|
| GPT-5.4 Mini | 400,000 tokens | ~300 pages of text |
| GPT-5 Mini | 128,000 tokens | ~100 pages of text |
| Claude Sonnet 4.5 | 200,000 tokens | ~150 pages of text |
| Llama 3.3 70B | 128,000 tokens | ~100 pages of text |
When context window matters in support
Long ticket histories
A customer who's been going back and forth for 15 messages generates a lot of conversation context. Each message — plus the agent's replies, tool calls, and tool results — adds to the token count. On a 128K model, a very long ticket conversation can approach the limit.
Large tool results
When your agent fetches a ticket with a long description, searches a knowledge base, and reads a multi-page document, each tool result adds thousands of tokens. An agent that chains 5-6 tool calls in one run can consume 20,000-40,000 tokens just from tool results.
Rich agent instructions
Detailed agent instructions (WISMO classification rules, response templates, business logic) can be 2,000-5,000 tokens. Add tool definitions for 10+ tools and you're at 8,000-10,000 tokens before the first message.
What happens when the window fills up
Macha automatically manages context window limits using conversation compaction. When the estimated token count crosses 55% of the model's context window, the middle portion of the conversation is summarized — keeping the first message (original intent) and recent messages (current context) intact. The summary is generated by a fast model and replaces the middle section, freeing up space for the conversation to continue.
This is transparent to the user — the conversation continues naturally. But the summarized middle section loses some detail. If the agent needed to reference a specific number or ID from 10 messages ago, it might not have it after compaction.
The practical takeaway
For most support agents, GPT-5 Mini's 128K context window is more than sufficient. A typical autonomous run (read ticket + check order + draft response) uses 15,000-25,000 tokens — well below the limit. The 400K window of GPT-5.4 Mini becomes valuable when you have agents that handle very long conversations, chain many tool calls, or operate on tickets with extensive history.
Choose your model based on task complexity and conversation length, not just raw context size. A bigger window helps only when you actually fill it.