Architecture
Deep dive into Code LoD's internal architecture and design.
Overview
Code LoD uses a modular architecture with clear separation of concerns:
┌─────────────────────────────────────────────────────────────────────┐
│ CLI Interface │
│ (cli.py) │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ Core Engine │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ Code Parser │ │ Hash Engine │ │ Staleness Tracker │ │
│ │ (parsers/) │ │ (hashing.py) │ │ (staleness.py) │ │
│ └────────┬────────┘ └────────┬────────┘ └─────────────────────┘ │
└───────────┼────────────────────┼────────────────────────────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────┐
│ Storage Layer │
│ ┌─────────────────┐ ┌─────────────────────┐ │
│ │ SQLite DB │ │ .lod Files │ │
│ │ (db.py) │ │ (lod_file/) │ │
│ └─────────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ LLM Integration │
│ (llm/) │
└─────────────────────────────────────────────────────────────────────┘
Components
CLI Layer (cli/)
Built with Typer, the CLI provides all user-facing commands. Each command is implemented in its own module:
| Command | Handler | Description |
|---|---|---|
init |
cli/init.py |
Creates project structure and config |
generate |
cli/generate.py |
Parses code, generates descriptions |
status |
cli/status.py |
Reports freshness status |
validate |
cli/validate.py |
Checks for stale descriptions |
update |
cli/update.py |
Regenerates stale descriptions |
read |
cli/read.py |
Outputs descriptions for LLMs |
config |
cli/config.py |
Get/set configuration values |
config set-model |
cli/config.py |
Configure LLM models per scope |
install-hook |
cli/hooks.py |
Installs git hooks |
uninstall-hook |
cli/hooks.py |
Removes git hooks |
clean |
cli/clean.py |
Removes all code-lod data |
Parser Layer (parsers/)
Code parsers extract entities using tree-sitter.
BaseParser Interface
class BaseParser(ABC):
@property
@abstractmethod
def language(self) -> str:
"""Return the language name this parser handles."""
@abstractmethod
def parse_file(self, path: Path) -> list[ParsedEntity]:
"""Parse a file and extract code entities."""
@abstractmethod
def parse_module(self, source: str, path: Path) -> ParsedEntity:
"""Parse a module as a whole."""
TreeSitterParser
The default parser uses tree-sitter to support 20+ languages:
- Python, JavaScript, TypeScript
- Go, Rust, Java, C, C++, C#
- Ruby, PHP, Swift, Kotlin
- And more...
Hash Engine (hashing.py)
AST-based hashing detects semantic changes while ignoring cosmetic differences.
Hash Computation
def compute_ast_hash(source: str) -> str:
normalized = _normalize_source(source)
hash_obj = hashlib.sha256(normalized.encode())
return f"sha256:{hash_obj.hexdigest()}"
Normalization
The _normalize_source() function:
- Strips comments
- Normalizes whitespace
- Normalizes string literals
- Preserves semantic structure
This ensures that formatting changes don't trigger staleness.
Staleness Tracker (staleness.py)
Tracks which descriptions need updating.
class StalenessTracker:
def check_entity(self, entity: ParsedEntity) -> StalenessStatus
def check_entities(self, entities: list[ParsedEntity]) -> FreshnessStatus
def mark_stale(self, hash_: str) -> None
def mark_fresh(self, hash_: str) -> None
Revert Detection
Uses hash_history in the database to detect when code reverts to a previous version:
def check_revert(self, current_hash: str) -> tuple[bool, str | None]:
record = self.hash_index.get(current_hash)
if record and not record.stale:
return True, record.description
return False, None
Storage Layer
SQLite Database (db.py)
The HashIndex provides fast hash-to-description lookup:
CREATE TABLE descriptions (
hash TEXT PRIMARY KEY,
description TEXT NOT NULL,
stale BOOLEAN DEFAULT FALSE,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
hash_history TEXT DEFAULT '[]'
);
.lod Files (lod_file/)
Structured comment files stored alongside source code:
# @lod hash:sha256:a1b2c3d4... stale:false
# @lod description: Provides user authentication functionality.
def authenticate_user(username: str, password: str) -> str:
...
Why dual storage?
- SQLite: Fast lookups, caching, revert detection
- .lod files: Human-readable, version-controlled, LLM-consumable
LLM Integration (llm/description_generator/)
Abstract interface for description generation with multiple provider implementations:
class Provider(str, Enum):
OPENAI = "openai"
ANTHROPIC = "anthropic"
OLLAMA = "ollama"
MOCK = "mock"
def get_generator(
provider: Provider | None = None,
model: str | None = None,
) -> DescriptionGenerator:
"""Get a description generator instance."""
Provider Implementations
OpenAI (openai.py): GPT-4, GPT-4o, GPT-3.5-turbo via openai package
Anthropic (anthropic.py): Claude Sonnet, Claude Haiku, Claude Opus via anthropic package
Ollama (ollama.py): Local models (codellama, mistral, llama2, etc.) via ollama package
Mock (mock.py): Placeholder descriptions for testing (no API key required)
Auto-Detection
The get_generator() function auto-detects providers from environment variables:
1. Checks ANTHROPIC_API_KEY → uses Anthropic
2. Checks OPENAI_API_KEY → uses OpenAI
3. Falls back to Mock generator
Scope-Specific Models
Configure different models for different hierarchical scopes:
code-lod config set-model --scope function --provider openai --model gpt-4o
code-lod config set-model --scope project --provider anthropic --model claude-sonnet
Base Generator Interface
class DescriptionGenerator(ABC):
@abstractmethod
def generate(self, entity: ParsedEntity, context: str | None = None) -> str:
"""Generate a description for a code entity."""
@abstractmethod
def generate_batch(self, entities: list[ParsedEntity], context: str | None = None) -> list[str]:
"""Generate descriptions for multiple entities."""
The BaseLLMDescriptionGenerator provides:
- Prompt templates for function, class, and module scopes
- Source truncation for large code blocks
- Automatic fallback to mock on API errors
Data Models
Scope Hierarchy
class Scope(str, Enum):
PROJECT = "project" # Entire codebase
PACKAGE = "package" # Directory/module group
MODULE = "module" # Single file
CLASS = "class" # Class definition
FUNCTION = "function" # Function/method
ParsedEntity
@dataclass
class ParsedEntity:
scope: Scope
name: str
location: CodeLocation
source: str
ast_hash: str
language: str
parent_name: str | None = None
File Structure
src/code_lod/
├── __init__.py
├── __main__.py # Entry point
├── cli/ # CLI commands (one per file)
│ ├── __init__.py # Main app, command registration
│ ├── clean.py # Clean command
│ ├── config.py # Config and set-model commands
│ ├── generate.py # Generate command
│ ├── hooks.py # install-hook, uninstall-hook
│ ├── init.py # Init command
│ ├── read.py # Read command
│ ├── status.py # Status command
│ ├── update.py # Update command
│ └── validate.py # Validate command
├── config.py # Configuration and paths management
├── db.py # SQLite database layer
├── hashing.py # AST hash computation
├── models.py # Pydantic data models
├── staleness.py # Staleness tracking
├── llm/ # LLM integration
│ ├── __init__.py
│ └── description_generator/
│ ├── generator.py # Base classes, Provider enum, get_generator()
│ ├── anthropic.py # Anthropic Claude provider
│ ├── openai.py # OpenAI provider
│ ├── ollama.py # Ollama local models provider
│ └── mock.py # Mock generator for testing
├── parsers/ # Code parsers
│ ├── __init__.py
│ ├── base.py # BaseParser interface
│ └── tree_sitter_parser.py
└── lod_file/ # .lod file management
├── __init__.py
├── comment_parser.py # Parse @lod comments
├── reader.py # Read .lod files
└── writer.py # Write .lod files
Directory Layout
After running code-lod init:
your-project/
├── .code-lod/
│ ├── config.json # Project configuration
│ ├── hash-index.db # SQLite database (not version controlled)
│ └── .lod/ # Description files (version controlled)
│ └── src/
│ └── module.py.lod
└── src/
└── module.py
Design Principles
- Plugin Architecture: Parsers and generators use abstract base classes for extensibility
- Hash-Based Change Detection: Semantic changes trigger updates, formatting doesn't
- Dual Storage: SQLite for performance, .lod files for portability
- Frozen Dataclasses: Immutable data where possible (
CodeLocation) - Context Managers: Safe database connection handling
- Structured Logging: All operations logged via structlog
Performance Considerations
- Hash lookups: O(1) via SQLite primary key
- Tree-sitter parsing: Fast, incremental parsing
- Lazy generation: Only regenerates stale descriptions
- Caching: Database serves as description cache