Module 1: How LLMs Actually Chunk Content

Module 1 of 9

What This Teaches

Why "sections" are not real chunks.

This module explains how LLMs actually break content into chunks, and why visual structure does not determine chunk boundaries. See also: Core Concepts

Core Concepts

Token-based chunking: LLMs chunk based on token limits, not visual structure
DOM boundary heuristics: HTML structure influences but does not control chunking
Sentence vs semantic chunking: Chunks may split mid-sentence or combine multiple sentences
Overlap windows: Chunks overlap to preserve context, but overlap is not guaranteed
Truncation bias: Long sentences or paragraphs are truncated, losing information

Key Truth

LLMs do not respect:

paragraphs
headings
visual sections

They chunk based on:

token limits
punctuation density
semantic similarity
transformer attention patterns

Practical Rule

If a fact requires previous text, it is already broken.

Design every chunk to be self-contained. Never assume surrounding context will be preserved.

Optional Operator Task

Task: Select any existing webpage (yours or a competitor's). Copy three consecutive paragraphs. Split them into extraction-sized chunks as an LLM would, ignoring visual structure.

Constraint: Each chunk must be token-limited (approximately 512 tokens). Split at semantic boundaries, not sentence boundaries.

What success looks like: You produce a list of chunks where each chunk can be read in isolation without losing meaning. If a chunk requires previous context to be understood, you've identified a chunking failure.

This task is optional. No submission required. No validation. Use it to convert theory into applied thinking.

Next: Module 2: Chunk Atomicity and Inference Cost