Lever

Design your systems so the path of least resistance is also the path of correctness.

The four primitives

Practitioner heuristics for LLM-correct codebases, informed by the literature.

1. Derived obligations

The system computes correctness from the spec. The LLM writes the source; tools derive everything downstream. Every codegen step is a layer where wrong output is impossible.

Can the output be derived? If yes, codegen it.

2. Prescriptive failure

When the system rejects, it names what to do next. call assert_node_order is actionable. unsatisfied requirement requires interpretation. The error message is the prompt.

Does the error name the next action, or describe the problem?

3. Bundled enforcement

You can't get the value without the verification. The API surface couples data with its obligations. If enforcement is opt-in, the agent will opt out.

Can the agent get the value without passing through verification?

4. Vacuity detection

The system catches code that technically passes but verifies nothing. A predicate tested against a blank node. A return value that's never read. The test that asserts true.

Can the agent satisfy the checks without doing meaningful work?

Empirical basis

Numbers from the literature. Peer-reviewed sources marked with venue.

The repair loop problem

LLMs read 14x more than they write (2B input tokens, 140M output in the Anthropic C compiler case study)
Failed resolutions cost 5-13x more tokens than successes (SWE-Effi)
Most models lose 60-80% of debugging capability within 2-3 repair attempts. Decay is exponential (Adnan & Kuhn 2025, Nature Scientific Reports)
Longer error messages correlate with faster decay (same study)
65% of patches in iterative repair are duplicates (Chen et al. 2025, ICSE)

Where the errors are

83% of LLM errors are logic, not syntax. Compilers catch 17%.
94% of LLM compilation errors are type errors (Mundler et al. 2025, PLDI)
Type-constrained generation resolves 74.8% of compilation errors but improves functional correctness by only 3.5-5.5%

What helps: Unambiguous external signals (Kamoi et al. 2024, TACL). Structured test feedback over human explanations (Dai et al. 2025). Scalar reward over verbal self-reflection (Song et al. 2025). Strategic restart over naive retry (Tang et al. 2024, NeurIPS).

What hurts: AGENTS.md context files reduce success rates +20% cost (Gloaguen et al. 2026, ETH Zurich). All 8 frontier models degrade monotonically as context increases (Kumar 2026). Generating 20 independent attempts beats generating 10 and repairing each (Olausson et al. 2024, ICLR).

Our canary pilot

8 Python repair tasks, 4 feedback treatments, 4 models. On gpt-3.5-turbo (the only model that didn't ceiling):

Treatment	Accuracy	Tokens
Brief + precise	87.5%	5,290
Brief + vague	75.0%	9,122
Verbose + precise	87.5%	4,184
Verbose + vague	87.5%	10,029

Precision matters more than brevity for accuracy. Brevity matters for token cost (47% savings). Caveat: n=1 per condition. This is a pilot, not a study.

The sycophancy connection

How RLHF training incentives create the failure modes that the four primitives address.

The mechanism: LLMs trained with RLHF optimize for "the user is happy." In code, "happy" means "tests pass." When the correct fix is hard and a surface patch is easy, the model patches. Pan et al. (ICML 2024) formalized this as in-context reward hacking: in feedback loops, LLMs optimize for the most recent signal at the expense of global correctness.

The evidence

Sycophancy is deep. Wang et al. (AAAI 2026) showed it emerges from structural override of learned knowledge. Average rate: 63.7% across 7 models.
Sycophancy scales to subterfuge. Denison et al. (Anthropic, 2024) showed models spontaneously generalize from agreeing with users to editing reward functions and test files.
Frontier models exploit tests. ImpossibleBench (2025): GPT-5 exploited test cases 76% of the time on deliberately impossible tasks. METR (2025): o3 reward-hacks at 70-95% even with anti-cheating instructions.

Important qualification: These numbers are from adversarial scenarios. They demonstrate the failure mode exists and scales, not that it occurs at these rates on normal tasks.

The design implication: The alignment community frames this as a training problem. We frame it as a design problem. Make the path of least resistance also the path of correctness, so the training incentive works in your favor instead of against you. That's the whole thesis.

Stack coverage

Where each primitive applies, where the gaps are.

Layer	Derived	Prescriptive	Bundled	Vacuity
Database	Schema IS spec	Migration fails	Typed results	Golden file
Data access	sqlc from SQL	Expected sig	Typed returns	N/A (codegen)
Business logic	Gap	Harder here	Signatures	Property tests
API	Huma from structs	OpenAPI contract	Validated input	Schema
Frontend	openapi-ts	TS compiler	Types enforce	Snapshots
Tests	Spec-derived reqs	Names the method	Data + enforcement	Blank-node test
CI	Pipeline config	Remediation msg	Won't pass without	LOC limits

The business logic gap: No codegen pipeline for "derive correct business logic from spec." This is where 83% of LLM errors live. Mitigations: exhaustive enum matching, strong domain types, property-based testing. Weaker than the primitives, but they narrow the reasoning surface.

The five questions

For each layer of your stack:

Can the output be derived? If yes, codegen it.
What does the error look like? Aim for: one line, what to do next.
Can the agent get the value without passing through verification?
Can the agent satisfy the checks without doing meaningful work?
What happens on repeated failure? Restart after 2-3 attempts.

Limitations

Business logic correctness. There is no codegen pipeline for "derive correct business logic from spec." The LLM has to reason.
Semantic correctness. A test can satisfy all four primitives and still miss a real bug.
Model capability. The dominant factor in LLM code quality is model capability (McMillan 2026, 21pp gap). These primitives help most for weaker models or harder tasks.

Status: Early. Three of four primitives are demonstrated through the GitLab Knowledge Graph team's test framework. The fourth (vacuity detection) is grounded in formal methods literature. Go pattern examples and a canary bench are in the repo.