Back to blog
Data

Knowledge Base Chunking Is Product Design

If your agent retrieves the wrong context, the model may still write a beautiful answer. That is the problem.

Knowledge TeamJanuary 30, 202610 min read

Retrieval-augmented generation often gets discussed as infrastructure: embeddings, indexes, vector stores, rerankers, and latency. Those pieces matter. But many retrieval failures start earlier, at the moment a messy document is chopped into chunks that no longer resemble how a human would use the source.

A policy is not just words. It has sections, exceptions, dates, owners, and dependencies. A troubleshooting guide has steps. A contract has clauses. A sales playbook has examples and caveats. If chunking destroys that structure, the agent may retrieve a fragment that looks relevant but lacks the condition that makes it true.

Billing policy

Chunk 1 · source tagged · updated Apr 2026

Refund exceptions

Chunk 2 · source tagged · updated Apr 2026

Account tiers

Chunk 3 · source tagged · updated Apr 2026

Retrieval improves when chunks preserve the shape of the source: policy, exception, owner, and date.
Useful retrieval chunks preserve source, topic, freshness, and the boundary where meaning changes.

Chunk by meaning before size

There is no universal chunk size that fixes retrieval. A short FAQ answer may be a complete chunk. A refund policy may need the rule, exception, and approval threshold together. A procedure may need one chunk per step, with the prerequisite attached. The right boundary is the smallest unit that can answer a user intent without losing the source logic.

Start with semantic boundaries: headings, clauses, checklist items, procedure steps, table rows, and examples. Then check whether each chunk can stand on its own. If a chunk begins with "this exception" and the exception is no longer attached to the rule, the chunk is too small or missing metadata.

Overlap is a tool, not a ritual

Overlap helps when meaning crosses a boundary. It hurts when it floods retrieval with near-duplicates. Too much overlap can make the top results look consistent while hiding the one chunk that contains the actual answer. Use overlap where continuity matters: long procedures, multi-part clauses, or definitions that are referenced across nearby sections.

Metadata makes retrieval controllable

The most useful metadata is boring: source name, owner, version, published date, product area, customer segment, region, and content type. Those fields let the workflow filter before retrieval or rerank after retrieval. They also help a reviewer understand why an answer was grounded in a given source.

  • Source metadata tells the agent where the answer came from.
  • Version metadata prevents old policies from winning over current ones.
  • Domain metadata keeps legal, billing, support, and product knowledge from blending together.
  • Owner metadata gives teams a path to fix bad source material.

Evaluate retrieval separately from generation

If an answer is bad, teams often blame the model. First ask whether the right chunks were retrieved. A retrieval eval should test whether the top results include the source a human would use. Only after retrieval is healthy should generation quality become the main question.

In Trumpets, knowledge sources are part of the agent operating system: they connect to workflows, prompts, benchmarks, and review gates. That means chunking is not a back-office ingestion detail. It is product design for how an agent knows what it is allowed to say.

All posts