Case Studies

Three production repos,
three real transformations.

I applied nv:context to three very different production codebases: a Python SDK with 4,612 tests, a multi-product AI SaaS, and a bilingual WhatsApp agent. Here's what changed in each, and what I learned.

Jump to selectools Most detailed nichevlabs sheriff

01 · Case Study Most detailed

selectools

Python SDK · 4,612 tests · L3 → L5–L6

440→67

CLAUDE.md lines −85% reduction

49→58/60

leverage score +9 across 6 layers

−53%

always-loaded tokens 16.3K → 7.6K

deterministic hooks format, lint, push, compact, stale

01 Baseline

One file trying to do everything.

The repo had a single 440-line CLAUDE.md doing the work of six files at once: project overview, directory tree, code conventions, testing patterns, feature checklists, release workflows, 26 common pitfalls, and the project roadmap.

ETH Zurich research shows LLMs reliably follow ~150–200 instructions before attention starts to fall off. At 440 lines, the agent was hitting diminishing returns on later instructions long before it ever read the code it was supposed to change.

CLAUDE.md 440 lines Subdirectory configs 0 Hooks 0 Maturity L3

02 What changed

Progressive disclosure: 67-line root, scoped subdirectories, 5 hooks.

The 440-line file was rewritten as a 67-line root focused on commands, stability markers, and the 26 pitfalls condensed to one-liners. Everything else moved to the place an agent would actually need it.

Root CLAUDE.md: 440 → 67 lines

Removed the 80-line directory tree (agents discover this), the 50-line conventions block (moved to src/selectools/CLAUDE.md), the 50-line feature checklist (the /feature skill already covers this), the release history pattern (discoverable from git log), and the roadmap (already in ROADMAP.md). What stayed: 7 exact commands, the stability-markers table, the 27-StepType reference, and 26 pitfalls condensed to one line each.

3 scoped subdirectory CLAUDE.md files (130 lines total)

tests/CLAUDE.md (46 lines) for Agent setup, mock patterns and test gotchas. src/selectools/CLAUDE.md (51 lines) for code style, stability markers, the provider protocol, and source-only pitfalls. docs/CLAUDE.md (43 lines) for the MkDocs Material build and the 8-item feature documentation checklist.

Universal AGENTS.md (89 lines)

Three-tier boundaries (Always / Ask First / Never), 15 condensed landmines, the subagent fan-out and worktree patterns. AGENTS.md is the universal baseline read by 25+ tools (Cursor, Copilot, Windsurf, Aider, Gemini CLI), not just Claude.

5 hooks for the things that MUST happen 100% of the time

PostToolUse auto-formats every .py file with Black + isort. PreToolUse blocks git push to main/master. PreToolUse runs flake8 before any commit. PostCompact re-injects the top 30 lines of CLAUDE.md after context compression. SessionStart warns if configs are older than 14 days.

.claudeignore: 45 patterns

Excluded the 2,000-line landing/ directory, notebooks, build artifacts, and the docs CSS/JS assets. Without this, every read wasted context on files irrelevant to Python work.

CI coverage gate at 90%

The test command in .github/workflows/ci.yml changed from pytest -n auto to pytest -n auto --cov=selectools --cov-fail-under=90. The project sits at 95%. The gate is a floor, not a target.

HANDOFF.md template + /handoff skill

Five structured sections (What I Was Doing, Current State, What's Left, Key Decisions, Watch Out For), filled by a skill that reads git status and test status before /clear. Document-and-clear outperforms auto-compaction because human-curated handoffs preserve nuance.

selectools / progressive disclosure

// Always loaded for any task
Root CLAUDE.md (67 lines)        ← orientation + commands + 26 pitfalls
AGENTS.md      (89 lines)        ← universal boundaries + landmines

// Loaded only when working in that directory
tests/CLAUDE.md          (46)
src/selectools/CLAUDE.md (51)
docs/CLAUDE.md           (43)

// On demand
.claude/skills/   (10 skills, only when invoked by name)
.claude/settings.json  (5 hooks, always active, deterministic)

An agent editing src/selectools/agent/core.py now loads root (67) + AGENTS.md (89) + src/selectools/CLAUDE.md (51) = 207 lines of highly relevant context. Previously the same agent loaded 440 lines, 60% of which was irrelevant to the task.

03 Results

Token budget down 53%. Leverage score up 9 points.

The "always-loaded" budget dropped from ~16.3K tokens to ~7.6K. The full set of files an agent could load (when working in any directory) dropped from ~16.3K to ~10.4K. More effective context, fewer tokens.

Layer Before After Change

Verification n/a 10 / 10 CI + 90% gate

CLAUDE.md / AGENTS.md n/a 10 / 10 Progressive disclosure

Hooks 0 / 10 10 / 10 +10

Skills n/a 10 / 10 10 skills

Subagent patterns n/a 9 / 10 Documented

Session management n/a 9 / 10 HANDOFF + .claudeignore

Overall 49 / 60 58 / 60 +9

Always-loaded tokens 16.3K → 7.6K Max-loaded tokens 16.3K → 10.4K Maturity L3 → L5–L6

04 Key insight

Loading fewer, more relevant lines beats loading everything.

"At 440 lines, every additional rule was making the file worse. At 67 lines, every line is a landmine or a command. Content the agent cannot discover by reading code."

selectools · final report

The agent gets 207 lines of highly relevant context per task instead of 440 lines of everything. The total context budget dropped 53%, and the agent makes fewer mistakes, not more. That's the entire thesis of context engineering in one repository.

Next case study: nichevlabs → ← Back to nv:context

02 · Case Study

nichevlabs

Multi-product SaaS · orchestrated by selectools · L4 → L6

805→59

SESSION.md lines −93% reduction

17→49/60

leverage score +32 points

15.8K

tokens saved / session on session start alone

bugs found + fixed in orchestration + control plane

01 Baseline

L4 setup, 17 / 60 leverage. SESSION.md was eating the context window.

The repo had a 52-line CLAUDE.md and a clean directory layout, so the maturity level looked OK on paper. The reality: an 805-line SESSION.md was being loaded on every session start. About 17K tokens of historical build log, Steps 1–4 checklists, and a directory tree that agents could discover for free.

No AGENTS.md. No subdirectory CLAUDE.md files. No .claudeignore. No HANDOFF.md. No hooks at all. Tests existed (554 passing) but had no coverage floor, so a future change could silently drop coverage and nothing would catch it.

SESSION.md 805 lines / ~17K tokens AGENTS.md none Hooks 0 Subdirectory configs 0 Maturity L4

02 What changed

Split SESSION.md, write AGENTS.md, scope by subdirectory, hook the rest.

The biggest single win was splitting SESSION.md. Then a fresh AGENTS.md with three-tier boundaries, six scoped subdirectory configs, and four hooks. Plus a bug-hunt subagent that found 81 real bugs while it was at it.

SESSION.md split: 805 → 59 lines (highest impact)

Current state stayed in the 59-line file. The full history moved to docs/SESSION_ARCHIVE.md. About 80% of the original was historical build log and a directory tree. Content agents never reference. Saved ~15,800 tokens per session start.

AGENTS.md created (120 lines, pruned from 126 after a bug hunt)

Three-tier boundaries (Always / Ask First / Never), 7 landmines (down from 10 once the bug hunt fixed 4 of them), and the universal commands. Pruned the obsolete landmines so agents wouldn't waste effort on non-issues.

6 scoped subdirectory CLAUDE.md files (23–39 lines each)

web/src/CLAUDE.md, api/src/platform/CLAUDE.md, api/tests/CLAUDE.md, api/.../agents/CLAUDE.md, api/.../products/CLAUDE.md, and supabase/CLAUDE.md. Each loads only when an agent is working inside that directory.

4 hooks: format, migration guard, PostCompact, pre-commit

PreToolUse blocks edits to migration files (100% enforcement, not 95%). PostToolUse auto-formats Python after every Write/Edit. PostCompact re-injects the current landmines. Pre-commit re-checks formatting before the commit lands.

CI coverage gate added (60% floor)

Added pytest-cov with --cov-fail-under=60 to the deploy.yml API job. 60% is a starting floor. The plan is to ratchet it up as the suite matures. The point is: a future change can no longer silently drop coverage.

learn-from-reviews GitHub Action (compounding engineering)

When a reviewer comments @claude-learn [rule] on a PR, the Action auto-creates a follow-up PR that adds that rule to AGENTS.md. The codebase learns from every review. Input is sanitized through env-var indirection to prevent script injection.

Continuous sync: .githooks/ultracontext-sync.sh

A non-blocking pre-commit warning that fires when package/CI/lint configs change, when agent configs are older than 14 days, or when soft negatives appear in any config file. Catches drift at commit time, before it becomes stale advice.

Negative-instruction scan: 4 soft negatives → 0

Re-scanned every config file for "don't" / "avoid" / "do not". Rewrote each one as a positive RFC 2119 instruction (MUST, NEVER). Negative instructions increase the probability of the agent doing the wrong thing. They draw attention to it.

03 Results

17 / 60 to 49 / 60. Five layers improved, one stayed at “working as designed.”

The leverage score nearly tripled. Token budget sits at ~7K (about 5.5% of the 128K context window). Well under the 40% threshold where research shows precision starts to drop.

Layer Before After Change

Verification 5 / 10 8 / 10 +3

CLAUDE.md / AGENTS.md 3 / 10 9 / 10 +6

Hooks 0 / 10 9 / 10 +9

Skills 5 / 10 7 / 10 +2

Subagent patterns 4 / 10 7 / 10 +3

Session management 0 / 10 9 / 10 +9

Overall 17 / 60 49 / 60 +32

Token budget ~7K (5.5% of 128K) Bugs found in bug-hunt 81 Soft negatives 4 → 0 Maturity L4 → L6

04 Key insight

SESSION.md is where context goes to die.

"Eighty percent of SESSION.md was historical build log and a directory tree. Agents never reference it. They reference the code. Splitting it saved fifteen thousand eight hundred tokens on every session start."

nichevlabs · final report

The corollary: a bug hunt found 81 real bugs while writing AGENTS.md, and 4 of the 10 landmines I'd written turned out to describe bugs that had since been fixed. Stale landmines waste tokens AND confuse agents. A config file is only useful if you delete the lines that stop being true.

Next case study: sheriff → ← Back to nv:context

03 · Case Study

sheriff

Python + TypeScript · WhatsApp agent · L4 → L5

36→42/60

leverage score +6 points

files modified 2 created · 1 deleted

~10%

context window used healthy, under the 40% threshold

01 Baseline

An already-strong L4 setup. The work was incremental polish, not rewrite.

Sheriff already had a 31-line root CLAUDE.md, a 115-line AGENTS.md with proper RFC 2119 language, three scoped subdirectory CLAUDE.md files, an active whatsapp-reviewer agent with an 8-point checklist, and HANDOFF.md for session continuity. The original author had followed best practices.

The starting leverage was 36 / 60. Verification was the weakest layer at 5 / 10. Tests existed but no coverage gate, no type checking in CI, and the web tests weren't even being run. There was a 222-line stale session.md from a previous brand name that was duplicating HANDOFF.md's role with outdated information.

CLAUDE.md (root) 31 lines AGENTS.md 115 lines Subdirectory configs 3 Hooks 3 Maturity L4

02 What changed

Add the missing safety hooks. Add a coverage floor. Delete the stale stuff.

Twelve targeted edits across hooks, CI, and config hygiene. No rewrites. The principle: when a setup is already strong, the highest-leverage gaps are usually missing guardrails and stale artifacts, not new files.

PreToolUse hook: block git push to main

The single highest-impact missing guardrail. Prevents accidental pushes to production via the agent. 100% enforcement: the hook either runs or it doesn't, no probabilistic instruction-following involved.

Enhanced PostCompact hook

PostCompact now re-injects CLAUDE.md plus the full three-tier boundaries plus all current landmines after context compression. The original hook only re-injected the landmines, so agents would lose the boundaries after every compaction.

CI coverage gate at 40%, mypy on whatsapp.py, web tests in CI

Added --cov=src/cashcop --cov-fail-under=40 to pytest in deploy.yml, added a mypy step for whatsapp.py, and added pnpm test to the web job. Web tests existed but weren't actually being run in CI. That's the kind of thing you only catch when you do the audit.

Brand name cleanup

The project was rebranded from MyCashCop to Sheriff. Updated api/pyproject.toml from mycashcop-api to sheriff-api, and fixed the whatsapp-reviewer agent's description and body to match. User-facing names should be consistent, because agents will copy what they see.

Hardened .claudeignore

Added .env* (defense in depth for secrets), the stale session.md, docs/superpowers/, media files (.svg, .mp3, .wav, .mp4), and mobile build artifacts. None of this is content the agent should be spending tokens on.

Deleted the 222-line stale session.md

The file was MyCashCop-era. It duplicated HANDOFF.md's role with outdated information and could only confuse agents about the current state of the world. Some of the highest-leverage edits to a config setup are deletions.

Makefile: test-web + test-all targets

The Makefile had API test targets but no web test target despite pnpm test being available. Added the web target and a test-all target that runs both, then surfaced make test-all in AGENTS.md's commands section.

learn-from-reviews GitHub Action

Same compounding engineering pattern: when a reviewer comments @claude-learn [rule], the Action opens a PR that adds the sanitized rule to AGENTS.md. Input is run through env-var indirection to block script injection, so ${{ }} never appears inside a run: block.

03 Results

+6 points across the layers that mattered. ~7% token cost increase.

The token cost went up slightly (~300 tokens, +7%) because the enhanced PostCompact hook re-injects more content. That's a worthwhile trade for boundaries that survive compaction. Total context cost stayed at ~10% of the 200K window. Healthy by every metric in the research.

Layer Before After Change

Verification 5 / 10 7 / 10 +2

CLAUDE.md / AGENTS.md 7 / 10 8 / 10 +1

Hooks 7 / 10 9 / 10 +2

Skills 5 / 10 5 / 10 unchanged (solo dev)

Subagent patterns 5 / 10 5 / 10 unchanged (solo dev)

Session management 7 / 10 8 / 10 +1

Overall 36 / 60 42 / 60 +6

Files modified 10 Files created 2 Files deleted 1 Maturity L4 → L5

04 Key insight

When the setup is already strong, the leverage is in the gaps and the deletions.

"Can the agent discover this by reading the code? If yes, delete it. The highest-leverage gaps in a strong setup are usually missing safety hooks and stale artifacts, not new files."

sheriff · final report

The other lesson: the negative-instruction audit found exactly one soft negative across six files, in a HANDOFF.md template comment that wasn't even agent guidance. The original author had already followed RFC 2119 throughout. A good config setup is mostly a discipline of subtraction, and Sheriff's author had been practicing it from day one.

↑ Back to top: selectools ← Back to nv:context

Three production repos, three real transformations.

One file trying to do everything.

Progressive disclosure: 67-line root, scoped subdirectories, 5 hooks.

Token budget down 53%. Leverage score up 9 points.

Loading fewer, more relevant lines beats loading everything.

L4 setup, 17 / 60 leverage. SESSION.md was eating the context window.

Split SESSION.md, write AGENTS.md, scope by subdirectory, hook the rest.

17 / 60 to 49 / 60. Five layers improved, one stayed at “working as designed.”

SESSION.md is where context goes to die.

An already-strong L4 setup. The work was incremental polish, not rewrite.

Add the missing safety hooks. Add a coverage floor. Delete the stale stuff.

+6 points across the layers that mattered. ~7% token cost increase.

When the setup is already strong, the leverage is in the gaps and the deletions.

Try it on your repo.

Three production repos,
three real transformations.