Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.karpathy.phronesisagent.com/llms.txt

Use this file to discover all available pages before exploring further.

LLMs are exceptionally good at looping until they meet specific, verifiable goals. When you give vague instructions like “fix the authentication system,” you get vague results — the LLM makes changes, declares success, and leaves you to figure out whether anything actually improved. Goal-Driven Execution flips this: transform every task into a concrete success criterion, and let the LLM run until it can prove the goal is met.

The core transformation

Instead of telling the LLM what to do, give it a definition of done:
Instead of…Transform to…
”Add validation""Write tests for invalid inputs, then make them pass"
"Fix the bug""Write a test that reproduces it, then make it pass"
"Refactor X""Ensure tests pass before and after”

The multi-step plan format

For tasks with multiple steps, state a brief plan where each step has an explicit verification check:
1. [Step] → verify: [check]
2. [Step] → verify: [check]
3. [Step] → verify: [check]
This format makes progress visible and gives the LLM clear stopping conditions for each phase.

Key insight

From Andrej Karpathy:
“LLMs are exceptionally good at looping until they meet specific goals… Don’t tell it what to do, give it success criteria and watch it go.”
Strong success criteria let the LLM loop independently. Weak criteria (“make it work”) require constant clarification.

Vague vs. verifiable example

Request: “Fix the authentication system”
What LLMs do wrong — vague plan with no success criteria
I'll fix the authentication system by:
1. Reviewing the code
2. Identifying issues
3. Making improvements
4. Testing the changes

[Proceeds to make changes without clear success criteria]
There’s no way to verify this is done, no way to confirm it solved the right problem, and no way to catch regressions.
The correct approach — define the specific issue and verify each step
To "fix authentication," I need to define success criteria.
What specific issue are we solving?

If the issue is "users stay logged in after password change":

Plan:
1. Write test: Change password → verify old session invalidated
   Verify: Test fails (reproduces bug)

2. Implement: Invalidate sessions on password change
   Verify: Test passes

3. Check edge cases: Multiple active sessions, concurrent changes
   Verify: Additional tests pass

4. Verify no regression: Existing auth tests still pass
   Verify: Full test suite green

Current test coverage for auth: [shows coverage]
What's the specific auth issue you're seeing?

Multi-step with verification example

Request: “Add rate limiting to the API”
What LLMs do wrong — implement everything at once with no verification steps
# Implements full rate limiting with Redis, multiple strategies,
# configuration system, and monitoring in one 300-line commit
# No clear verification steps
The correct approach — incremental steps, each independently verifiable
Plan for rate limiting:

1. Add basic in-memory rate limiting (single endpoint)
   Verify:
   - Test: 100 requests → first 10 succeed, rest get 429
   - Manual: curl endpoint 11 times, see rate limit error

2. Extract to middleware (apply to all endpoints)
   Verify:
   - Test: Rate limits apply to /users and /posts
   - Existing endpoint tests still pass

3. Add Redis backend (for multi-server)
   Verify:
   - Test: Rate limit persists across app restarts
   - Test: Two app instances share rate limit counter

4. Add configuration (rates per endpoint)
   Verify:
   - Test: /search allows 10/min, /users allows 100/min
   - Config file parsed correctly

Each step is independently verifiable and deployable.
Start with step 1?

Test-first verification example

Request: “The sorting breaks when there are duplicate scores”
What LLMs do wrong — fix without reproducing the bug first
# Immediately changes sort logic without confirming the bug
def sort_scores(scores):
    return sorted(scores, key=lambda x: (-x['score'], x['name']))
This might fix the issue. It might not. There’s no way to know without a test that first demonstrates the broken behavior.
The correct approach — reproduce first, then fix
# 1. First, write a test that reproduces the issue
def test_sort_with_duplicate_scores():
    """Test sorting when multiple items have same score."""
    scores = [
        {'name': 'Alice', 'score': 100},
        {'name': 'Bob', 'score': 100},
        {'name': 'Charlie', 'score': 90},
    ]

    result = sort_scores(scores)

    # The bug: order is non-deterministic for duplicates
    # Run this test multiple times, it should be consistent
    assert result[0]['score'] == 100
    assert result[1]['score'] == 100
    assert result[2]['score'] == 90

# Verify: Run test 10 times → fails with inconsistent ordering

# 2. Now fix with stable sort
def sort_scores(scores):
    """Sort by score descending, then name ascending for ties."""
    return sorted(scores, key=lambda x: (-x['score'], x['name']))

# Verify: Test passes consistently

When this principle is working

Strong success criteria produce a clear loop: write a failing test → implement → confirm it passes → confirm nothing else broke. When you see this loop running without constant clarification questions, Goal-Driven Execution is doing its job.
You’ll know it’s working when:
  • The LLM asks for a specific failure condition before touching auth, search, or other complex systems
  • Multi-step tasks arrive with explicit verify checkpoints at each step
  • Bug fixes start with a reproducing test rather than a code change
  • You can confirm a task is complete by running a command, not by reading the diff and guessing