The Problem
You're working on a bug in a 50-file codebase. You want AI help, but the context window fills up after file 8. You start manually summarizing code, losing critical details. The AI hallucinates because it can't see the full picture. Or worse: you paste everything and hit token limits mid-response.
Raw context dumping doesn't scale. Manual summarization loses signal. You need semantic compression that preserves meaning while reducing tokens.
The Core Insight
LLMLingua compresses prompts by removing low-information tokens while preserving semantic structure.
Think of it like ZIP for language, but semantically aware. It doesn't just count character frequency - it uses language models to identify which words/tokens carry the most meaning. A function signature matters more than a comment restating the obvious. Import statements matter less than the actual logic.
The key: compression happens at the semantic level, not syntactic.
The Walkthrough
Basic Integration Pattern
Start with the simplest integration: compress code files before sending to AI.
# Install LLMLingua
pip install llmlingua
# Basic compression
from llmlingua import PromptCompressor
compressor = PromptCompressor()
# Original prompt (8000 tokens)
code_context = read_files(['user.py', 'auth.py', 'db.py'])
question = "Why is login failing for new users?"
# Compress to 50% (4000 tokens)
compressed = compressor.compress_prompt(
code_context,
instruction=question,
rate=0.5 # Target 50% compression
)
# Send compressed version to AI
response = ai_chat(compressed['compressed_prompt'])
Compression Rates
Start conservative (0.5 = 50% compression). Aggressive compression (0.2 = 80% reduction) works for redundant content but destroys nuanced code. Tune based on your use case.
Selective Compression Strategy
Not all context needs equal compression. Critical code gets light touch, boilerplate gets aggressive.
def compress_codebase(files, focus_file):
"""Compress with focus-aware rates."""
compressed_context = []
for file in files:
if file == focus_file:
# Light compression for main file
rate = 0.7 # Keep 70%
elif is_dependency(file, focus_file):
# Medium compression for deps
rate = 0.5 # Keep 50%
else:
# Aggressive for peripherals
rate = 0.3 # Keep 30%
compressed = compressor.compress_prompt(
read_file(file),
rate=rate
)
compressed_context.append({
'file': file,
'content': compressed['compressed_prompt']
})
return compressed_context
Preserve Critical Structures
Some code elements should never be compressed. Use targeted protection:
def protect_critical_code(code):
"""Extract critical parts before compression."""
# Extract function signatures
signatures = extract_function_signatures(code)
# Extract type definitions
types = extract_type_definitions(code)
# Extract error messages (often critical context)
errors = extract_error_strings(code)
# Compress the rest
compressible = remove_critical_parts(code)
compressed = compressor.compress_prompt(compressible, rate=0.4)
# Reassemble with critical parts intact
return reconstruct(signatures, types, errors, compressed)
Real Example: Debugging Session
You're debugging an authentication bug across 30 files (45K tokens). Here's the compression pipeline:
| File Type | Original Tokens | Compression Rate | Compressed Tokens |
|---|---|---|---|
| auth.py (bug location) | 3,500 | 0.8 (keep 80%) | 2,800 |
| Dependencies (5 files) | 12,000 | 0.5 (keep 50%) | 6,000 |
| Models/Utils (10 files) | 18,000 | 0.3 (keep 30%) | 5,400 |
| Tests (15 files) | 11,500 | 0.2 (keep 20%) | 2,300 |
| Total | 45,000 | - | 16,500 |
Result: 63% reduction while keeping bug-relevant context intact.
Failure Patterns
1. Over-Compressing Critical Context
Symptom: AI gives generic answers because compressed context lost the specific details needed.
Fix: Use tiered compression. Main focus area gets 70-80% retention, not 30%.
2. Compressing Already-Terse Code
Symptom: Compressed output is unreadable. AI can't parse it.
Fix: Set minimum compression thresholds. If a file is under 500 tokens, don't compress it.
def smart_compress(file_content):
tokens = count_tokens(file_content)
# Skip compression for small files
if tokens < 500:
return file_content
# Light compression for medium files
if tokens < 2000:
return compressor.compress_prompt(file_content, rate=0.7)
# Aggressive compression for large files
return compressor.compress_prompt(file_content, rate=0.4)
3. Losing Code Structure
Symptom: Compressed code has syntax errors or broken indentation.
Fix: Use structure-aware compression that respects AST boundaries.
4. Compressing Error Messages
Symptom: AI can't diagnose issues because error context was compressed away.
Fix: Extract and preserve error messages, stack traces, and log output before compression.
When NOT to Use LLMLingua
- Code generation tasks: AI needs full context to generate correct code
- First-time codebase exploration: Compression hides patterns you haven't seen yet
- Security audits: Missing a single line can be critical
- Small contexts: If you're under 50% of token limit, skip compression
Advanced Patterns
Dynamic Compression Based on Token Budget
def adaptive_compress(files, max_tokens=100000):
"""Compress just enough to fit budget."""
current_tokens = sum(count_tokens(f) for f in files)
if current_tokens <= max_tokens:
return files # No compression needed
# Calculate required compression rate
target_rate = max_tokens / current_tokens
# Apply compression with calculated rate
return [compressor.compress_prompt(f, rate=target_rate)
for f in files]
Iterative Compression for Complex Queries
For multi-turn conversations, compress previous context as new context arrives:
class ConversationCompressor:
def __init__(self, max_context_tokens=50000):
self.context = []
self.max_tokens = max_context_tokens
def add_turn(self, user_msg, ai_response):
self.context.append({'user': user_msg, 'ai': ai_response})
# Compress older turns more aggressively
if self.token_count() > self.max_tokens:
self.compress_history()
def compress_history(self):
"""Compress older messages more than recent ones."""
for i, turn in enumerate(self.context):
age_factor = i / len(self.context)
rate = 0.8 - (age_factor * 0.5) # 80% → 30%
turn['compressed'] = compressor.compress_prompt(
turn['user'] + turn['ai'],
rate=rate
)
Quick Reference
Compression Rate Guidelines:
- 0.8 (80% retention): Bug location, main file being edited
- 0.6 (60% retention): Direct dependencies, imported modules
- 0.4 (40% retention): Utility files, peripheral code
- 0.2 (20% retention): Tests, docs, boilerplate
Never Compress:
- Function signatures you're modifying
- Error messages and stack traces
- Type definitions being referenced
- Files under 500 tokens
Integration Checklist:
# 1. Install
pip install llmlingua
# 2. Basic compression
from llmlingua import PromptCompressor
compressor = PromptCompressor()
# 3. Compress with focus
compressed = compressor.compress_prompt(
context,
instruction=question,
rate=0.5
)
# 4. Validate output quality
verify_compressed_readability(compressed)