Module 1: How LLMs Actually Chunk Content
Module 1 of 9
What This Teaches
Why "sections" are not real chunks.
This module explains how LLMs actually break content into chunks, and why visual structure does not determine chunk boundaries. See also: Core Concepts
Core Concepts
- Token-based chunking: LLMs chunk based on token limits, not visual structure
- DOM boundary heuristics: HTML structure influences but does not control chunking
- Sentence vs semantic chunking: Chunks may split mid-sentence or combine multiple sentences
- Overlap windows: Chunks overlap to preserve context, but overlap is not guaranteed
- Truncation bias: Long sentences or paragraphs are truncated, losing information
Key Truth
LLMs do not respect:
- paragraphs
- headings
- visual sections
They chunk based on:
- token limits
- punctuation density
- semantic similarity
- transformer attention patterns
Practical Rule
If a fact requires previous text, it is already broken.
Design every chunk to be self-contained. Never assume surrounding context will be preserved.
Optional Operator Task
Task: Select any existing webpage (yours or a competitor's). Copy three consecutive paragraphs. Split them into extraction-sized chunks as an LLM would, ignoring visual structure.
Constraint: Each chunk must be token-limited (approximately 512 tokens). Split at semantic boundaries, not sentence boundaries.
What success looks like: You produce a list of chunks where each chunk can be read in isolation without losing meaning. If a chunk requires previous context to be understood, you've identified a chunking failure.
This task is optional. No submission required. No validation. Use it to convert theory into applied thinking.