Data Sources

What Are Data Sources

Data sources give your agents access to knowledge — documents, files, websites, and live data from connected services. While connectors let agents take actions in external tools, data sources let agents reference information to provide better, more accurate responses.

For example, you might upload your company's return policy as a PDF, connect a Google Doc with your product FAQ, or index your website's help pages. When a customer asks a question, the agent can look up the answer in these sources rather than relying solely on its training data.

File Uploads

You can upload files directly to Macha as a data source. Supported formats include:

  • PDF — Product manuals, policy documents, contracts, reports.
  • CSV — Customer lists, product catalogs, pricing tables.
  • XLSX — Spreadsheets with structured data.
  • DOCX — Word documents with procedures, guides, or reference material.
  • TXT — Plain text files.

How Files Are Processed

Macha processes uploaded files differently depending on their size and type:

Small Spreadsheets (2,000 rows or fewer)

CSV and XLSX files with 2,000 rows or fewer are injected directly into the agent's context. This means the entire file content is available to the agent at all times — no searching required. The agent can reference any row or column immediately. This is the fastest and most reliable way for an agent to work with structured data.

Large Files

Files that exceed the injection threshold — large spreadsheets (more than 2,000 rows), PDFs, DOCX files, and other documents — are processed differently. Macha chunks the content into smaller segments and creates embeddings (vector representations) for semantic search. The agent then searches this knowledge base using the search_knowledge tool and retrieves specific document sections with the get_document tool.

For spreadsheets, chunking is row-aware: each chunk contains approximately 75 rows with the header row preserved, so the agent always knows what each column represents.

Tip

If your spreadsheet has fewer than 2,000 rows, it will be injected directly and the agent will always have full access to the data. For larger datasets, consider splitting the file into smaller, focused spreadsheets if possible.

Website Sources

You can add a website URL as a data source. Macha will crawl and index the site's pages, making the content searchable by your agents. This is useful for documentation sites, help centers, knowledge bases, and public-facing content that you want agents to be able to reference.

Website sources are processed the same way as large files — the content is chunked, embedded, and made available through the search_knowledge and get_document tools.

Connector Sources

In addition to static file uploads, you can connect live data from Google Docs and Notion. These sources stay up to date because the agent reads the content directly from the service every time it needs it, rather than working from a static copy.

Google Docs

When you add Google Docs as a data source, the agent reads documents live using the google_read_doc tool. Every time the agent references a document, it fetches the current version — so if someone updates the Google Doc, the agent immediately sees the changes.

Notion Pages

Notion works the same way. The agent reads pages live using the notion_get_page tool, always getting the most current content.

Auto-Linking Tools and Connectors

When you add a connector-based source (Google or Notion) to an agent's data sources, Macha automatically adds the required read tools and connector instance to the agent. You do not need to manually assign the Google or Notion connector — it happens for you.

For Google sources, Macha auto-adds: google_read_doc, google_read_sheet, and google_list_drive_files.

For Notion sources, Macha auto-adds: notion_search and notion_get_page.

These auto-added tools are locked — you cannot remove them while the data source is linked. If you want to remove the tools, remove the data source first.

Tip

Connector sources are ideal for documents that change frequently. Instead of re-uploading files every time they are updated, let the agent read the live version from Google Docs or Notion.

Scope Filtering

When you add a data source to an agent, you can control which documents the agent has access to:

  • All documents — The agent can access every document in the source. As new documents are added to the source, the agent automatically gains access to them.
  • Selected documents — The agent can only access the specific documents you choose. This is useful when a source contains many documents but the agent only needs a subset.

Scope filtering works across all source types — file uploads, websites, Google Docs, and Notion pages.

How Agents Access Knowledge

Understanding how your agent reads data sources helps you configure them effectively. There are three modes of access:

Injected Documents

Small CSV and XLSX files (2,000 rows or fewer) are injected directly into the agent's system prompt. The agent always has this data available — no tool calls needed. This is the fastest and most reliable access mode, but it uses context window space.

Searchable Documents

Large files, PDFs, DOCX files, and website content are stored as searchable knowledge. When the agent needs information from these sources, it:

  1. Calls the search_knowledge tool with a query to find relevant document chunks.
  2. Reviews the search results to identify the most relevant sections.
  3. Calls the get_document tool to retrieve the full content of specific chunks.

This two-step process lets agents work with very large knowledge bases efficiently — they only load the parts they need.

Live Connector Documents

Google Docs and Notion pages are read live using their respective tools (google_read_doc or notion_get_page). The agent calls the tool with the document ID and gets back the current content. This ensures the agent always works with the latest version of the document.

Tip

You can mix all three types in a single agent. For example, inject a small product pricing spreadsheet, add a large policy PDF as searchable knowledge, and connect a live Google Doc with your latest FAQ — all on the same agent.

© 2026 AGZ Technologies Private Limited