Chunking
The process of splitting documents into smaller pieces (chunks) before creating embeddings. Since embedding models have token limits and work best on focused content, large documents need to be divided into manageable segments. How you chunk determines what your retrieval system can find, and, critically, what it misses.
Common strategies:
- Fixed-size chunking splits text into segments of a set token count (e.g., 512 tokens each). Simple but often splits mid-sentence or mid-paragraph. The blunt instrument approach.
- Recursive chunking tries to split at natural boundaries (paragraph breaks, then sentence breaks, then word breaks) falling back to smaller units only when needed. More surgical.
- Semantic chunking uses the content's meaning to determine split points, keeping conceptually related text together. The best approach and the hardest to implement.
Why it matters for writers: Chunking strategies are a frequent source of retrieval failures. A fact that spans two chunks may not be retrievable by either one. Cross-references between sections get severed. Document structure (headings, lists, tables) can be destroyed like a book run through a shredder. Writers who understand chunking can structure their content to survive it: self-contained sections, repeated key context, critical information placed where it's least likely to get split. This is one of the problems FractalRecall addresses with its metadata-aware approach.
Related terms: Embedding · Retrieval-Augmented Generation · Metadata Filtering