Architecture

Deep dive into Code LoD's internal architecture and design.

Overview

Code LoD uses a modular architecture with clear separation of concerns:

┌─────────────────────────────────────────────────────────────────────┐
│                          CLI Interface                              │
│                            (cli.py)                                 │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                         Core Engine                                 │
│  ┌─────────────────┐  ┌─────────────────┐  ┌─────────────────────┐ │
│  │ Code Parser     │  │   Hash Engine   │  │  Staleness Tracker  │ │
│  │ (parsers/)      │  │   (hashing.py)  │  │   (staleness.py)    │ │
│  └────────┬────────┘  └────────┬────────┘  └─────────────────────┘ │
└───────────┼────────────────────┼────────────────────────────────────┘
            │                    │
            ▼                    ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    Storage Layer                                    │
│  ┌─────────────────┐  ┌─────────────────────┐                       │
│  │ SQLite DB       │  │  .lod Files         │                       │
│  │ (db.py)         │  │  (lod_file/)        │                       │
│  └─────────────────┘  └─────────────────────┘                       │
└─────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    LLM Integration                                  │
│                      (llm/)                                         │
└─────────────────────────────────────────────────────────────────────┘

Components

CLI Layer (`cli/`)

Built with Typer, the CLI provides all user-facing commands. Each command is implemented in its own module:

Command	Handler	Description
`init`	`cli/init.py`	Creates project structure and config
`generate`	`cli/generate.py`	Parses code, generates descriptions
`status`	`cli/status.py`	Reports freshness status
`validate`	`cli/validate.py`	Checks for stale descriptions
`update`	`cli/update.py`	Regenerates stale descriptions
`read`	`cli/read.py`	Outputs descriptions for LLMs
`config`	`cli/config.py`	Get/set configuration values
`config set-model`	`cli/config.py`	Configure LLM models per scope
`install-hook`	`cli/hooks.py`	Installs git hooks
`uninstall-hook`	`cli/hooks.py`	Removes git hooks
`clean`	`cli/clean.py`	Removes all code-lod data

Parser Layer (`parsers/`)

Code parsers extract entities using tree-sitter.

BaseParser Interface

class BaseParser(ABC):
    @property
    @abstractmethod
    def language(self) -> str:
        """Return the language name this parser handles."""

    @abstractmethod
    def parse_file(self, path: Path) -> list[ParsedEntity]:
        """Parse a file and extract code entities."""

    @abstractmethod
    def parse_module(self, source: str, path: Path) -> ParsedEntity:
        """Parse a module as a whole."""

TreeSitterParser

The default parser uses tree-sitter to support 20+ languages:

Python, JavaScript, TypeScript
Go, Rust, Java, C, C++, C#
Ruby, PHP, Swift, Kotlin
And more...

Hash Engine (`hashing.py`)

AST-based hashing detects semantic changes while ignoring cosmetic differences.

Hash Computation

def compute_ast_hash(source: str) -> str:
    normalized = _normalize_source(source)
    hash_obj = hashlib.sha256(normalized.encode())
    return f"sha256:{hash_obj.hexdigest()}"

Normalization

The _normalize_source() function:

Strips comments
Normalizes whitespace
Normalizes string literals
Preserves semantic structure

This ensures that formatting changes don't trigger staleness.

Staleness Tracker (`staleness.py`)

Tracks which descriptions need updating.

class StalenessTracker:
    def check_entity(self, entity: ParsedEntity) -> StalenessStatus
    def check_entities(self, entities: list[ParsedEntity]) -> FreshnessStatus
    def mark_stale(self, hash_: str) -> None
    def mark_fresh(self, hash_: str) -> None

Revert Detection

Uses hash_history in the database to detect when code reverts to a previous version:

def check_revert(self, current_hash: str) -> tuple[bool, str | None]:
    record = self.hash_index.get(current_hash)
    if record and not record.stale:
        return True, record.description
    return False, None

Storage Layer

SQLite Database (`db.py`)

The HashIndex provides fast hash-to-description lookup:

CREATE TABLE descriptions (
    hash TEXT PRIMARY KEY,
    description TEXT NOT NULL,
    stale BOOLEAN DEFAULT FALSE,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    hash_history TEXT DEFAULT '[]'
);

.lod Files (`lod_file/`)

Structured comment files stored alongside source code:

# @lod hash:sha256:a1b2c3d4... stale:false
# @lod description: Provides user authentication functionality.

def authenticate_user(username: str, password: str) -> str:
    ...

Why dual storage?

SQLite: Fast lookups, caching, revert detection
.lod files: Human-readable, version-controlled, LLM-consumable

LLM Integration (`llm/description_generator/`)

Abstract interface for description generation with multiple provider implementations:

class Provider(str, Enum):
    OPENAI = "openai"
    ANTHROPIC = "anthropic"
    OLLAMA = "ollama"
    MOCK = "mock"

def get_generator(
    provider: Provider | None = None,
    model: str | None = None,
) -> DescriptionGenerator:
    """Get a description generator instance."""

Provider Implementations

OpenAI (openai.py): GPT-4, GPT-4o, GPT-3.5-turbo via openai package

Anthropic (anthropic.py): Claude Sonnet, Claude Haiku, Claude Opus via anthropic package

Ollama (ollama.py): Local models (codellama, mistral, llama2, etc.) via ollama package

Mock (mock.py): Placeholder descriptions for testing (no API key required)

Auto-Detection

The get_generator() function auto-detects providers from environment variables: 1. Checks ANTHROPIC_API_KEY → uses Anthropic 2. Checks OPENAI_API_KEY → uses OpenAI 3. Falls back to Mock generator

Scope-Specific Models

Configure different models for different hierarchical scopes:

code-lod config set-model --scope function --provider openai --model gpt-4o
code-lod config set-model --scope project --provider anthropic --model claude-sonnet

Base Generator Interface

class DescriptionGenerator(ABC):
    @abstractmethod
    def generate(self, entity: ParsedEntity, context: str | None = None) -> str:
        """Generate a description for a code entity."""

    @abstractmethod
    def generate_batch(self, entities: list[ParsedEntity], context: str | None = None) -> list[str]:
        """Generate descriptions for multiple entities."""

The BaseLLMDescriptionGenerator provides: - Prompt templates for function, class, and module scopes - Source truncation for large code blocks - Automatic fallback to mock on API errors

Data Models

Scope Hierarchy

class Scope(str, Enum):
    PROJECT = "project"      # Entire codebase
    PACKAGE = "package"      # Directory/module group
    MODULE = "module"        # Single file
    CLASS = "class"          # Class definition
    FUNCTION = "function"    # Function/method

ParsedEntity

@dataclass
class ParsedEntity:
    scope: Scope
    name: str
    location: CodeLocation
    source: str
    ast_hash: str
    language: str
    parent_name: str | None = None

File Structure

src/code_lod/
├── __init__.py
├── __main__.py         # Entry point
├── cli/                # CLI commands (one per file)
│   ├── __init__.py     # Main app, command registration
│   ├── clean.py        # Clean command
│   ├── config.py       # Config and set-model commands
│   ├── generate.py     # Generate command
│   ├── hooks.py        # install-hook, uninstall-hook
│   ├── init.py         # Init command
│   ├── read.py         # Read command
│   ├── status.py       # Status command
│   ├── update.py       # Update command
│   └── validate.py     # Validate command
├── config.py           # Configuration and paths management
├── db.py               # SQLite database layer
├── hashing.py          # AST hash computation
├── models.py           # Pydantic data models
├── staleness.py        # Staleness tracking
├── llm/                # LLM integration
│   ├── __init__.py
│   └── description_generator/
│       ├── generator.py    # Base classes, Provider enum, get_generator()
│       ├── anthropic.py    # Anthropic Claude provider
│       ├── openai.py       # OpenAI provider
│       ├── ollama.py       # Ollama local models provider
│       └── mock.py         # Mock generator for testing
├── parsers/            # Code parsers
│   ├── __init__.py
│   ├── base.py         # BaseParser interface
│   └── tree_sitter_parser.py
└── lod_file/           # .lod file management
    ├── __init__.py
    ├── comment_parser.py   # Parse @lod comments
    ├── reader.py           # Read .lod files
    └── writer.py           # Write .lod files

Directory Layout

After running code-lod init:

your-project/
├── .code-lod/
│   ├── config.json          # Project configuration
│   ├── hash-index.db        # SQLite database (not version controlled)
│   └── .lod/                # Description files (version controlled)
│       └── src/
│           └── module.py.lod
└── src/
    └── module.py

Design Principles

Plugin Architecture: Parsers and generators use abstract base classes for extensibility
Hash-Based Change Detection: Semantic changes trigger updates, formatting doesn't
Dual Storage: SQLite for performance, .lod files for portability
Frozen Dataclasses: Immutable data where possible (CodeLocation)
Context Managers: Safe database connection handling
Structured Logging: All operations logged via structlog

Performance Considerations

Hash lookups: O(1) via SQLite primary key
Tree-sitter parsing: Fast, incremental parsing
Lazy generation: Only regenerates stale descriptions
Caching: Database serves as description cache