Module 2: Chunk Atomicity and Inference Cost

Module 2 of 9

What This Teaches

Why multi-fact chunks fail retrieval.

This module explains the data engineering principle: each chunk is a row, and each row should answer one question. Related: Crouton Specification

Data Engineering Framing

Each chunk is a row.

Each row should answer one question.

This is not content writing. This is schema design for retrieval systems.

Inference Cost Principle

If an LLM must:

  • resolve pronouns
  • infer scope
  • merge facts
  • assume context

It increases:

  • CPU cost
  • hallucination risk
  • citation avoidance

LLMs avoid citing content that requires inference. They prefer explicit, self-contained facts.

Prechunking Rule

One chunk = one assertion = one retrieval target.

If a chunk contains multiple facts, split it. If a chunk requires context, make the context explicit within the chunk.

Optional Operator Task

Task: Take a narrative paragraph from any existing page. Extract exactly five atomic facts (croutons) from it. Each fact must be a single declarative sentence.

Constraint: No fact may contain pronouns, conjunctions, or implied context. Each fact must explicitly name all entities and relationships.

What success looks like: You produce five sentences where each sentence can be read in isolation without ambiguity. If a sentence requires the previous sentence to be understood, it's not atomic.

This task is optional. No submission required. No validation. Use it to convert theory into applied thinking.