Module 1: How LLMs Actually Chunk Content

Module 1 of 9

What This Teaches

Why "sections" are not real chunks.

This module explains how LLMs actually break content into chunks, and why visual structure does not determine chunk boundaries. See also: Core Concepts

Core Concepts

  • Token-based chunking: LLMs chunk based on token limits, not visual structure
  • DOM boundary heuristics: HTML structure influences but does not control chunking
  • Sentence vs semantic chunking: Chunks may split mid-sentence or combine multiple sentences
  • Overlap windows: Chunks overlap to preserve context, but overlap is not guaranteed
  • Truncation bias: Long sentences or paragraphs are truncated, losing information

Key Truth

LLMs do not respect:

  • paragraphs
  • headings
  • visual sections

They chunk based on:

  • token limits
  • punctuation density
  • semantic similarity
  • transformer attention patterns

Practical Rule

If a fact requires previous text, it is already broken.

Design every chunk to be self-contained. Never assume surrounding context will be preserved.

Optional Operator Task

Task: Select any existing webpage (yours or a competitor's). Copy three consecutive paragraphs. Split them into extraction-sized chunks as an LLM would, ignoring visual structure.

Constraint: Each chunk must be token-limited (approximately 512 tokens). Split at semantic boundaries, not sentence boundaries.

What success looks like: You produce a list of chunks where each chunk can be read in isolation without losing meaning. If a chunk requires previous context to be understood, you've identified a chunking failure.

This task is optional. No submission required. No validation. Use it to convert theory into applied thinking.