Building a Codebase Knowledge Base

Module 12: RAG & Vector Databases | Expansion Guide

Back to Module 12

The Problem

Your codebase has 50,000 files. You want an AI agent to understand it. You naively chunk every file, embed everything, and query it. Results are terrible: retrieves config.json when you ask about authentication, misses the actual auth module, and returns test fixtures as production examples.

Not all code is equally important. Treating it equally makes retrieval useless.

Most code RAG systems fail because they don't understand code semantics. They treat code like prose - it's not. Code has structure, dependencies, and hierarchies that pure text embeddings miss.

The Core Insight

Code RAG needs structural awareness, not just semantic similarity.

Think of a codebase like a graph: functions call each other, modules import dependencies, tests reference implementations. Flat text embeddings lose this structure. Production code RAG preserves it.

The winning approach: combine text embeddings with code structure metadata.

The Walkthrough

Architecture Overview

Codebase Knowledge Base
├─ File Prioritization (what to index)
├─ Code Chunking (how to split)
├─ Metadata Extraction (structure + context)
├─ Dependency Graph (relationships)
├─ Embedding + Indexing (vector DB)
└─ Incremental Updates (stay fresh)

Step 1: File Prioritization

Don't index everything. Prioritize by importance:

Priority File Types Why
High Core business logic, API routes, services Where actual work happens
Medium Utils, helpers, models, schemas Reusable components
Low Tests (index selectively), docs Reference, not implementation
Skip node_modules, build artifacts, configs Noise, not signal
def should_index_file(file_path: str) -> bool:
    """Decide if file should be indexed."""
    # Skip dependencies and build artifacts
    skip_patterns = ['node_modules', 'venv', '.git', 'dist', 'build']
    if any(pattern in file_path for pattern in skip_patterns):
        return False

    # Skip config files unless they're code
    if file_path.endswith(('.json', '.yml', '.yaml', '.env')):
        return False

    # Index source code
    code_extensions = ['.js', '.ts', '.py', '.java', '.go', '.rs']
    return any(file_path.endswith(ext) for ext in code_extensions)

Step 2: Semantic Code Chunking

Chunk by logical units (functions, classes), not lines:

from tree_sitter import Parser, Language

def chunk_code_semantically(code: str, language: str) -> list[dict]:
    """
    Parse code into semantic chunks (functions, classes, methods).
    """
    parser = Parser()
    parser.set_language(Language(f'tree-sitter-{language}'))
    tree = parser.parse(bytes(code, 'utf8'))

    chunks = []

    def extract_chunks(node, parent_context=""):
        if node.type in ['function_definition', 'class_definition', 'method_definition']:
            # Extract the full code for this unit
            chunk_code = code[node.start_byte:node.end_byte]

            # Extract metadata
            name = extract_name(node, code)
            docstring = extract_docstring(node, code)
            params = extract_parameters(node, code)

            chunks.append({
                'type': node.type,
                'name': name,
                'code': chunk_code,
                'docstring': docstring,
                'parameters': params,
                'parent_context': parent_context,
                'start_line': node.start_point[0],
                'end_line': node.end_point[0]
            })

            # Recurse for nested definitions
            for child in node.children:
                extract_chunks(child, parent_context=name)
        else:
            for child in node.children:
                extract_chunks(child, parent_context)

    extract_chunks(tree.root_node)
    return chunks

Step 3: Rich Metadata Extraction

Add context beyond the code itself:

def extract_metadata(file_path: str, chunk: dict) -> dict:
    """
    Add rich metadata for better retrieval.
    """
    return {
        **chunk,
        'metadata': {
            'file_path': file_path,
            'module': extract_module_path(file_path),
            'imports': extract_imports(chunk['code']),
            'calls_to': extract_function_calls(chunk['code']),
            'complexity': calculate_complexity(chunk['code']),
            'has_tests': check_for_tests(chunk['name'], file_path),
            'last_modified': get_git_last_modified(file_path),
            'primary_author': get_git_primary_author(file_path)
        }
    }

Step 4: Dependency Graph Integration

Track relationships between code units:

class CodebaseGraph:
    """
    Graph of code dependencies for enhanced retrieval.
    """
    def __init__(self):
        self.graph = nx.DiGraph()

    def add_function(self, func_name: str, metadata: dict):
        """Add function node with metadata."""
        self.graph.add_node(func_name, **metadata)

    def add_dependency(self, caller: str, callee: str):
        """Add edge from caller to callee."""
        self.graph.add_edge(caller, callee)

    def get_related_functions(self, func_name: str, depth: int = 2) -> list[str]:
        """
        Get functions related to this one (callers + callees).
        """
        # Get functions this one calls
        callees = list(nx.descendants(self.graph, func_name, depth))

        # Get functions that call this one
        callers = list(nx.ancestors(self.graph, func_name, depth))

        return callees + callers

Step 5: Hybrid Retrieval Strategy

Combine vector search with graph traversal:

def retrieve_relevant_code(query: str, top_k: int = 5) -> list[dict]:
    """
    Hybrid retrieval: semantic + structural.
    """
    # Step 1: Vector search for semantically similar chunks
    vector_results = vector_db.search(query, top_k=top_k*2)

    # Step 2: Expand with dependency graph
    expanded_results = []
    for result in vector_results:
        func_name = result['metadata']['name']

        # Add the matched function
        expanded_results.append(result)

        # Add related functions from graph
        related = codebase_graph.get_related_functions(func_name, depth=1)
        for rel_func in related[:2]:  # Top 2 related
            related_chunk = vector_db.get_by_name(rel_func)
            if related_chunk:
                expanded_results.append(related_chunk)

    # Step 3: Re-rank by relevance + importance
    ranked = rerank_results(expanded_results, query)

    return ranked[:top_k]

Why Hybrid Retrieval Works

Vector search finds: "What code is semantically similar to the query?"
Graph expansion adds: "What other code is structurally related?"
Together: You get both the answer and its context.

Step 6: Incremental Updates

Re-indexing 50k files on every commit is wasteful. Update smartly:

def incremental_update(changed_files: list[str]):
    """
    Update only changed files and their dependents.
    """
    for file_path in changed_files:
        # Remove old chunks for this file
        vector_db.delete_where(metadata={'file_path': file_path})

        # Re-chunk and re-index
        code = read_file(file_path)
        chunks = chunk_code_semantically(code, detect_language(file_path))

        for chunk in chunks:
            metadata = extract_metadata(file_path, chunk)
            embedding = embed_code(chunk['code'])
            vector_db.add(embedding, metadata)

        # Update dependency graph
        update_graph_for_file(file_path, chunks)

Production-Ready Pipeline

class CodebaseKnowledgeBase:
    """
    Production code RAG system.
    """
    def __init__(self, repo_path: str):
        self.repo_path = repo_path
        self.vector_db = init_vector_db()
        self.graph = CodebaseGraph()

    def index_codebase(self):
        """Initial indexing of entire codebase."""
        files = find_source_files(self.repo_path)

        for file_path in tqdm(files):
            if not should_index_file(file_path):
                continue

            self._index_file(file_path)

    def _index_file(self, file_path: str):
        """Index a single file."""
        code = read_file(file_path)
        language = detect_language(file_path)
        chunks = chunk_code_semantically(code, language)

        for chunk in chunks:
            # Enrich with metadata
            enriched = extract_metadata(file_path, chunk)

            # Embed and store
            embedding = embed_code(chunk['code'] + chunk['docstring'])
            self.vector_db.add(embedding, enriched)

            # Update graph
            self.graph.add_function(chunk['name'], enriched)

            # Add dependencies
            for callee in chunk['metadata']['calls_to']:
                self.graph.add_dependency(chunk['name'], callee)

    def query(self, question: str, top_k: int = 5) -> list[dict]:
        """Query the knowledge base."""
        return retrieve_relevant_code(question, top_k)

    def watch_for_changes(self):
        """Watch repo for changes and incrementally update."""
        # Use git hooks or file watcher
        for changed_file in watch_git_changes(self.repo_path):
            incremental_update([changed_file])

Failure Patterns

1. The Kitchen Sink Index

Symptom: You indexed node_modules, configs, 10k test fixtures - retrieval is garbage.

Fix: Be selective. Index production code, skip noise.

2. The Line-Count Chunker

Symptom: Functions split mid-implementation, context lost.

Fix: Use AST-based chunking. Respect code structure.

3. The Flat Embedding

Symptom: Retrieves similar code but misses dependencies.

Fix: Build dependency graph. Expand results structurally.

4. The Stale Index

Symptom: Agent suggests code from 50 commits ago.

Fix: Incremental updates on file changes. Git hooks or watchers.

The Embedding Model Matters

Code-specific embeddings (CodeBERT, StarEncoder) outperform general embeddings (OpenAI text-embedding-3) by 30%+ on code retrieval tasks. If code RAG is your core use case, use specialized models.

Quick Reference

Code RAG Pipeline:

  1. Filter files (skip deps, configs, build artifacts)
  2. Chunk by structure (AST parsing, not lines)
  3. Extract metadata (imports, calls, complexity)
  4. Build dependency graph (caller/callee relationships)
  5. Embed with code-specific model
  6. Hybrid retrieval (vector + graph)
  7. Incremental updates (re-index only changed files)

Key Metadata to Track:

Retrieval Strategy:

  1. Vector search for semantic matches
  2. Graph expansion for related code
  3. Re-rank by relevance + importance
  4. Return top-k with full context

Rule of Thumb:

Code isn't just text - it's a graph. Pure text embeddings miss structure. Combine semantic search with dependency graphs for production-grade code retrieval.