The Problem
Your agent generates code. It looks good. You run it - syntax error on line 4. The agent confidently hallucinated a library function that doesn't exist. If only it had checked its own work before declaring success.
Agents make mistakes. But they can also catch them - if you teach them to look.
Most developers build agents that generate and move on. No self-checking, no verification, no "wait, does this even make sense?" The reflection pattern adds that critical second look.
The Core Insight
Reflection is critique as a system component, not an afterthought.
Think of it like code review: you don't ship without review because you know your own biases blind you to bugs. Agents have the same problem, but they can be their own reviewer if you build in the reflection loop.
The pattern is simple: Generate → Critique → Refine → Repeat. The magic is in making critique automatic and systematic.
The Walkthrough
Basic Reflection Loop
def agent_with_reflection(task):
# Step 1: Generate initial solution
output = agent.generate(task)
# Step 2: Reflect on the output
critique = agent.critique(
task=task,
output=output,
criteria=["correctness", "completeness", "edge cases"]
)
# Step 3: Refine based on critique
if critique.has_issues:
output = agent.refine(
original_output=output,
critique=critique
)
return output
The Critique Prompt Pattern
The critique step is where the magic happens. Make it specific:
critique_prompt = f"""
You generated this code:
{generated_code}
Review it for:
1. Syntax errors (does this even run?)
2. Logical errors (does it do what was asked?)
3. Edge cases (what breaks this?)
4. Hallucinations (are you using real APIs/functions?)
For each issue found:
- Severity: critical/major/minor
- Location: where in the code
- Suggestion: how to fix
If no issues, respond with "APPROVED"
"""
Multi-Pass Reflection
Different critique lenses for different passes:
| Pass | Focus | Question |
|---|---|---|
| 1. Correctness | Does it work? | "Run this mentally. Does it produce the right output?" |
| 2. Completeness | Does it handle all cases? | "What inputs would break this?" |
| 3. Quality | Is it maintainable? | "Would you accept this in code review?" |
| 4. Safety | Can it cause harm? | "What's the worst that could happen?" |
Example: Code Generation with Reflection
# Step 1: Generate
code = agent.generate("Write a function to fetch user data from API")
# Step 2: First Reflection - Correctness
critique_1 = agent.reflect(f"""
Does this code work?
{code}
Check:
- Are imports real?
- Is the API call syntax correct?
- Will this run without errors?
""")
# Step 3: Refine if needed
if "issues found" in critique_1:
code = agent.refine(code, critique_1)
# Step 4: Second Reflection - Edge Cases
critique_2 = agent.reflect(f"""
What edge cases are missing?
{code}
Consider:
- API timeout
- Malformed response
- Network errors
- Invalid user ID
""")
# Step 5: Final refinement
if "issues found" in critique_2:
code = agent.refine(code, critique_2)
return code
The Verification Tool Pattern
Instead of asking the agent to "imagine" if code works, give it a tool to actually test:
tools = [
verify_syntax(code), # Run linter
execute_in_sandbox(code), # Actually run it
check_imports(code) # Verify libraries exist
]
Reflection with verification is more reliable than pure critique.
The Self-Consistency Check
Generate multiple solutions, have agent pick the best:
# Generate 3 different solutions
solutions = [
agent.generate(task),
agent.generate(task),
agent.generate(task)
]
# Agent critiques all and picks best
best = agent.select_best(f"""
Here are 3 solutions to: {task}
Solution A: {solutions[0]}
Solution B: {solutions[1]}
Solution C: {solutions[2]}
Which is best? Why? What would you improve?
""")
Failure Patterns
1. The Rubber-Stamp Reflection
Symptom: Agent critiques its own work and always says "looks good."
Fix: Make critique adversarial. Prompt: "You must find at least one issue, even if minor."
2. The Infinite Loop
Symptom: Agent keeps finding issues, refining, finding new issues, never finishing.
Fix: Set a max iteration limit (2-3 passes). After that, ship it.
3. The Vague Critique
Symptom: Critique says "this could be better" without specifics.
Fix: Force structured output: severity, location, exact fix needed.
4. The Same-Model Blindspot
Symptom: Agent makes mistake, then approves its own mistake in reflection.
Fix: Use a different model (or different family) for critique — e.g., Sonnet 4.6 for generation, Opus 4.6 for critique, or swap to Gemini 2.5 / GPT-5 for adversarial review.
The Confidence Paradox
Agents that are confident in their output are less likely to find issues in reflection. You may need to explicitly prompt: "Be skeptical. Assume there are bugs."
Advanced Pattern: Hierarchical Reflection
For complex tasks, reflect at multiple levels:
# Level 1: Line-by-line review
for line in code.split('\n'):
critique_line(line)
# Level 2: Function-level review
for function in extract_functions(code):
critique_function(function)
# Level 3: Architecture review
critique_overall_design(code)
Example: SQL Generation with Reflection
Without Reflection
query = agent.generate("Get all users who signed up last month")
# Returns: SELECT * FROM users WHERE signup_date > '2024-01-01'
# Issue: Hardcoded date, selects all columns, no indexes
With Reflection
query = agent.generate("Get all users who signed up last month")
critique = agent.reflect(f"""
Review this SQL:
{query}
Check:
1. Does it actually get "last month" (not hardcoded date)?
2. Are we selecting only needed columns?
3. Will this be slow on a large table?
4. Any SQL injection risks?
""")
# Agent finds issues, refines to:
# SELECT id, email, signup_date
# FROM users
# WHERE signup_date >= DATE_SUB(CURDATE(), INTERVAL 1 MONTH)
# AND signup_date < CURDATE()
# USE INDEX (idx_signup_date)
Quick Reference
Basic Reflection Loop:
- Generate initial output
- Critique with specific criteria
- Refine based on critique
- Repeat 1-2 times max
Critique Prompt Checklist:
- ✅ Specific criteria (not "check for issues")
- ✅ Structured output format
- ✅ Adversarial framing ("find problems")
- ✅ Severity levels (critical/major/minor)
When to Use Reflection:
- Code generation (high cost of errors)
- Critical business logic (accuracy matters)
- User-facing content (quality bar)
- Complex reasoning (multi-step verification)
Rule of Thumb:
If a human would review it before shipping, your agent should too. Reflection is automated QA.