Documentation Index
Fetch the complete documentation index at: https://docs.karpathy.phronesisagent.com/llms.txt
Use this file to discover all available pages before exploring further.
LLMs are exceptionally good at looping until they meet specific, verifiable goals. When you give vague instructions like “fix the authentication system,” you get vague results — the LLM makes changes, declares success, and leaves you to figure out whether anything actually improved. Goal-Driven Execution flips this: transform every task into a concrete success criterion, and let the LLM run until it can prove the goal is met.
The core transformation
Instead of telling the LLM what to do, give it a definition of done:
| Instead of… | Transform to… |
|---|
| ”Add validation" | "Write tests for invalid inputs, then make them pass" |
| "Fix the bug" | "Write a test that reproduces it, then make it pass" |
| "Refactor X" | "Ensure tests pass before and after” |
The multi-step plan format
For tasks with multiple steps, state a brief plan where each step has an explicit verification check:
1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]
This format makes progress visible and gives the LLM clear stopping conditions for each phase.
Key insight
From Andrej Karpathy:
“LLMs are exceptionally good at looping until they meet specific goals… Don’t tell it what to do, give it success criteria and watch it go.”
Strong success criteria let the LLM loop independently. Weak criteria (“make it work”) require constant clarification.
Vague vs. verifiable example
Request: “Fix the authentication system”
What LLMs do wrong — vague plan with no success criteriaI'll fix the authentication system by:
1. Reviewing the code
2. Identifying issues
3. Making improvements
4. Testing the changes
[Proceeds to make changes without clear success criteria]
There’s no way to verify this is done, no way to confirm it solved the right problem, and no way to catch regressions.
The correct approach — define the specific issue and verify each stepTo "fix authentication," I need to define success criteria.
What specific issue are we solving?
If the issue is "users stay logged in after password change":
Plan:
1. Write test: Change password → verify old session invalidated
Verify: Test fails (reproduces bug)
2. Implement: Invalidate sessions on password change
Verify: Test passes
3. Check edge cases: Multiple active sessions, concurrent changes
Verify: Additional tests pass
4. Verify no regression: Existing auth tests still pass
Verify: Full test suite green
Current test coverage for auth: [shows coverage]
What's the specific auth issue you're seeing?
Multi-step with verification example
Request: “Add rate limiting to the API”
What LLMs do wrong — implement everything at once with no verification steps# Implements full rate limiting with Redis, multiple strategies,
# configuration system, and monitoring in one 300-line commit
# No clear verification steps
The correct approach — incremental steps, each independently verifiablePlan for rate limiting:
1. Add basic in-memory rate limiting (single endpoint)
Verify:
- Test: 100 requests → first 10 succeed, rest get 429
- Manual: curl endpoint 11 times, see rate limit error
2. Extract to middleware (apply to all endpoints)
Verify:
- Test: Rate limits apply to /users and /posts
- Existing endpoint tests still pass
3. Add Redis backend (for multi-server)
Verify:
- Test: Rate limit persists across app restarts
- Test: Two app instances share rate limit counter
4. Add configuration (rates per endpoint)
Verify:
- Test: /search allows 10/min, /users allows 100/min
- Config file parsed correctly
Each step is independently verifiable and deployable.
Start with step 1?
Test-first verification example
Request: “The sorting breaks when there are duplicate scores”
What LLMs do wrong — fix without reproducing the bug first# Immediately changes sort logic without confirming the bug
def sort_scores(scores):
return sorted(scores, key=lambda x: (-x['score'], x['name']))
This might fix the issue. It might not. There’s no way to know without a test that first demonstrates the broken behavior.
The correct approach — reproduce first, then fix# 1. First, write a test that reproduces the issue
def test_sort_with_duplicate_scores():
"""Test sorting when multiple items have same score."""
scores = [
{'name': 'Alice', 'score': 100},
{'name': 'Bob', 'score': 100},
{'name': 'Charlie', 'score': 90},
]
result = sort_scores(scores)
# The bug: order is non-deterministic for duplicates
# Run this test multiple times, it should be consistent
assert result[0]['score'] == 100
assert result[1]['score'] == 100
assert result[2]['score'] == 90
# Verify: Run test 10 times → fails with inconsistent ordering
# 2. Now fix with stable sort
def sort_scores(scores):
"""Sort by score descending, then name ascending for ties."""
return sorted(scores, key=lambda x: (-x['score'], x['name']))
# Verify: Test passes consistently
When this principle is working
Strong success criteria produce a clear loop: write a failing test → implement → confirm it passes → confirm nothing else broke. When you see this loop running without constant clarification questions, Goal-Driven Execution is doing its job.
You’ll know it’s working when:
- The LLM asks for a specific failure condition before touching auth, search, or other complex systems
- Multi-step tasks arrive with explicit verify checkpoints at each step
- Bug fixes start with a reproducing test rather than a code change
- You can confirm a task is complete by running a command, not by reading the diff and guessing