Lever
Design your systems so the path of least resistance is also the path of correctness.
The four primitives
Practitioner heuristics for LLM-correct codebases, informed by the literature.
1. Derived obligations
The system computes correctness from the spec. The LLM writes the source; tools derive everything downstream. Every codegen step is a layer where wrong output is impossible.
Can the output be derived? If yes, codegen it.
2. Prescriptive failure
When the system rejects, it names what to do next. call assert_node_order is actionable. unsatisfied requirement requires interpretation. The error message is the prompt.
Does the error name the next action, or describe the problem?
3. Bundled enforcement
You can't get the value without the verification. The API surface couples data with its obligations. If enforcement is opt-in, the agent will opt out.
Can the agent get the value without passing through verification?
4. Vacuity detection
The system catches code that technically passes but verifies nothing. A predicate tested against a blank node. A return value that's never read. The test that asserts true.
Can the agent satisfy the checks without doing meaningful work?
Empirical basis
Numbers from the literature. Peer-reviewed sources marked with venue.
The repair loop problem
- LLMs read 14x more than they write (2B input tokens, 140M output in the Anthropic C compiler case study)
- Failed resolutions cost 5-13x more tokens than successes (SWE-Effi)
- Most models lose 60-80% of debugging capability within 2-3 repair attempts. Decay is exponential (Adnan & Kuhn 2025, Nature Scientific Reports)
- Longer error messages correlate with faster decay (same study)
- 65% of patches in iterative repair are duplicates (Chen et al. 2025, ICSE)
Where the errors are
- 83% of LLM errors are logic, not syntax. Compilers catch 17%.
- 94% of LLM compilation errors are type errors (Mundler et al. 2025, PLDI)
- Type-constrained generation resolves 74.8% of compilation errors but improves functional correctness by only 3.5-5.5%
Our canary pilot
8 Python repair tasks, 4 feedback treatments, 4 models. On gpt-3.5-turbo (the only model that didn't ceiling):
| Treatment | Accuracy | Tokens |
|---|---|---|
| Brief + precise | 87.5% | 5,290 |
| Brief + vague | 75.0% | 9,122 |
| Verbose + precise | 87.5% | 4,184 |
| Verbose + vague | 87.5% | 10,029 |
Precision matters more than brevity for accuracy. Brevity matters for token cost (47% savings). Caveat: n=1 per condition. This is a pilot, not a study.
The sycophancy connection
How RLHF training incentives create the failure modes that the four primitives address.
The evidence
- Sycophancy is deep. Wang et al. (AAAI 2026) showed it emerges from structural override of learned knowledge. Average rate: 63.7% across 7 models.
- Sycophancy scales to subterfuge. Denison et al. (Anthropic, 2024) showed models spontaneously generalize from agreeing with users to editing reward functions and test files.
- Frontier models exploit tests. ImpossibleBench (2025): GPT-5 exploited test cases 76% of the time on deliberately impossible tasks. METR (2025): o3 reward-hacks at 70-95% even with anti-cheating instructions.
Important qualification: These numbers are from adversarial scenarios. They demonstrate the failure mode exists and scales, not that it occurs at these rates on normal tasks.
Stack coverage
Where each primitive applies, where the gaps are.
| Layer | Derived | Prescriptive | Bundled | Vacuity |
|---|---|---|---|---|
| Database | Schema IS spec | Migration fails | Typed results | Golden file |
| Data access | sqlc from SQL | Expected sig | Typed returns | N/A (codegen) |
| Business logic | Gap | Harder here | Signatures | Property tests |
| API | Huma from structs | OpenAPI contract | Validated input | Schema |
| Frontend | openapi-ts | TS compiler | Types enforce | Snapshots |
| Tests | Spec-derived reqs | Names the method | Data + enforcement | Blank-node test |
| CI | Pipeline config | Remediation msg | Won't pass without | LOC limits |
The five questions
For each layer of your stack:
- Can the output be derived? If yes, codegen it.
- What does the error look like? Aim for: one line, what to do next.
- Can the agent get the value without passing through verification?
- Can the agent satisfy the checks without doing meaningful work?
- What happens on repeated failure? Restart after 2-3 attempts.
Limitations
- Business logic correctness. There is no codegen pipeline for "derive correct business logic from spec." The LLM has to reason.
- Semantic correctness. A test can satisfy all four primitives and still miss a real bug.
- Model capability. The dominant factor in LLM code quality is model capability (McMillan 2026, 21pp gap). These primitives help most for weaker models or harder tasks.