Module 2: Chunk Atomicity and Inference Cost
Module 2 of 9
What This Teaches
Why multi-fact chunks fail retrieval.
This module explains the data engineering principle: each chunk is a row, and each row should answer one question. Related: Crouton Specification
Data Engineering Framing
Each chunk is a row.
Each row should answer one question.
This is not content writing. This is schema design for retrieval systems.
Inference Cost Principle
If an LLM must:
- resolve pronouns
- infer scope
- merge facts
- assume context
It increases:
- CPU cost
- hallucination risk
- citation avoidance
LLMs avoid citing content that requires inference. They prefer explicit, self-contained facts.
Prechunking Rule
One chunk = one assertion = one retrieval target.
If a chunk contains multiple facts, split it. If a chunk requires context, make the context explicit within the chunk.
Optional Operator Task
Task: Take a narrative paragraph from any existing page. Extract exactly five atomic facts (croutons) from it. Each fact must be a single declarative sentence.
Constraint: No fact may contain pronouns, conjunctions, or implied context. Each fact must explicitly name all entities and relationships.
What success looks like: You produce five sentences where each sentence can be read in isolation without ambiguity. If a sentence requires the previous sentence to be understood, it's not atomic.
This task is optional. No submission required. No validation. Use it to convert theory into applied thinking.