Lever

Design your systems so the path of least resistance is also the path of correctness.

The four primitives

Practitioner heuristics for LLM-correct codebases, informed by the literature.

1. Derived obligations

The system computes correctness from the spec. The LLM writes the source; tools derive everything downstream. Every codegen step is a layer where wrong output is impossible.

Can the output be derived? If yes, codegen it.

2. Prescriptive failure

When the system rejects, it names what to do next. call assert_node_order is actionable. unsatisfied requirement requires interpretation. The error message is the prompt.

Does the error name the next action, or describe the problem?

3. Bundled enforcement

You can't get the value without the verification. The API surface couples data with its obligations. If enforcement is opt-in, the agent will opt out.

Can the agent get the value without passing through verification?

4. Vacuity detection

The system catches code that technically passes but verifies nothing. A predicate tested against a blank node. A return value that's never read. The test that asserts true.

Can the agent satisfy the checks without doing meaningful work?

Empirical basis

Numbers from the literature. Peer-reviewed sources marked with venue.

The repair loop problem

Where the errors are

What helps: Unambiguous external signals (Kamoi et al. 2024, TACL). Structured test feedback over human explanations (Dai et al. 2025). Scalar reward over verbal self-reflection (Song et al. 2025). Strategic restart over naive retry (Tang et al. 2024, NeurIPS).
What hurts: AGENTS.md context files reduce success rates +20% cost (Gloaguen et al. 2026, ETH Zurich). All 8 frontier models degrade monotonically as context increases (Kumar 2026). Generating 20 independent attempts beats generating 10 and repairing each (Olausson et al. 2024, ICLR).

Our canary pilot

8 Python repair tasks, 4 feedback treatments, 4 models. On gpt-3.5-turbo (the only model that didn't ceiling):

TreatmentAccuracyTokens
Brief + precise87.5%5,290
Brief + vague75.0%9,122
Verbose + precise87.5%4,184
Verbose + vague87.5%10,029

Precision matters more than brevity for accuracy. Brevity matters for token cost (47% savings). Caveat: n=1 per condition. This is a pilot, not a study.

The sycophancy connection

How RLHF training incentives create the failure modes that the four primitives address.

The mechanism: LLMs trained with RLHF optimize for "the user is happy." In code, "happy" means "tests pass." When the correct fix is hard and a surface patch is easy, the model patches. Pan et al. (ICML 2024) formalized this as in-context reward hacking: in feedback loops, LLMs optimize for the most recent signal at the expense of global correctness.

The evidence

Important qualification: These numbers are from adversarial scenarios. They demonstrate the failure mode exists and scales, not that it occurs at these rates on normal tasks.

The design implication: The alignment community frames this as a training problem. We frame it as a design problem. Make the path of least resistance also the path of correctness, so the training incentive works in your favor instead of against you. That's the whole thesis.

Stack coverage

Where each primitive applies, where the gaps are.

LayerDerivedPrescriptiveBundledVacuity
DatabaseSchema IS specMigration failsTyped resultsGolden file
Data accesssqlc from SQLExpected sigTyped returnsN/A (codegen)
Business logicGapHarder hereSignaturesProperty tests
APIHuma from structsOpenAPI contractValidated inputSchema
Frontendopenapi-tsTS compilerTypes enforceSnapshots
TestsSpec-derived reqsNames the methodData + enforcementBlank-node test
CIPipeline configRemediation msgWon't pass withoutLOC limits
The business logic gap: No codegen pipeline for "derive correct business logic from spec." This is where 83% of LLM errors live. Mitigations: exhaustive enum matching, strong domain types, property-based testing. Weaker than the primitives, but they narrow the reasoning surface.

The five questions

For each layer of your stack:

  1. Can the output be derived? If yes, codegen it.
  2. What does the error look like? Aim for: one line, what to do next.
  3. Can the agent get the value without passing through verification?
  4. Can the agent satisfy the checks without doing meaningful work?
  5. What happens on repeated failure? Restart after 2-3 attempts.

Limitations

Status: Early. Three of four primitives are demonstrated through the GitLab Knowledge Graph team's test framework. The fourth (vacuity detection) is grounded in formal methods literature. Go pattern examples and a canary bench are in the repo.