Project Vectorizer

A powerful CLI tool that vectorizes codebases, stores them in a vector database, tracks changes, and serves them via MCP (Model Context Protocol) for AI agents like Claude, Codex, and others.

Latest Version: 0.1.4 | Changelog | GitHub

📋 Table of Contents

Features
Installation
Quick Start
Performance Optimization
CLI Commands
Configuration
Search Features
MCP Server
Advanced Usage
Troubleshooting
Changelog
Contributing

Features

🚀 Performance & Optimization

Auto-Optimized Config - Auto-detect CPU cores and RAM for optimal settings (--optimize)
Max Resources Mode - Use maximum system resources for fastest indexing (--max-resources)
Smart Incremental - 60-70% faster indexing with intelligent change categorization
Git-Aware Indexing - 80-90% faster by indexing only git-changed files
Parallel Processing - Multi-threaded with auto-detected optimal worker count (up to 16 workers)
Memory Monitoring - Real-time memory tracking with automatic garbage collection
Batch Optimization - Memory-based batch size calculation for safe processing

🔍 Search & Indexing

Code Vectorization - Parse and vectorize with sentence-transformers or OpenAI embeddings
Multi-Level Chunking - Functions, classes, micro-chunks, and word-level chunks for precision
Enhanced Single-Word Search - High-precision search for single keywords (0.8+ thresholds)
Semantic + Exact Search - Combines semantic similarity with exact word matching
Adaptive Thresholds - Automatically adjusts for optimal results
Multiple Languages - 30+ languages (Python, JS, TS, Go, Rust, Java, C++, C, PHP, Ruby, Swift, Kotlin, and more)

🔄 Change Management

Git Integration - Track changes via git commits with index-git command
Smart File Categorization - Detects New, Modified, and Deleted files
Watch Mode - Real-time monitoring with configurable debouncing (0.5-10s)
Incremental Updates - Only re-index changed content
Hash-Based Detection - SHA256 file hashing for accurate change detection

🌐 AI Integration

MCP Server - Model Context Protocol for AI agents (Claude, Codex, etc.)
HTTP Fallback API - RESTful endpoints when MCP unavailable
Semantic Search - Natural language queries for code discovery
File Operations - Get content, list files, project statistics

🎨 User Experience

Clean Progress Output - Single unified progress bar with timing information
Suppressed Library Logs - No cluttered batch progress bars from dependencies
Timing Information - Elapsed time for all operations (seconds or minutes+seconds)
Verbose Mode - Optional detailed logging for debugging
Professional UI - Rich terminal output with colors, panels, and formatting
Real-time Updates - Live file names and status tags during indexing

💾 Database & Storage

ChromaDB Backend - High-performance vector database
Fast HNSW Indexing - Optimized similarity search algorithm
Scalable - Handles 500K+ chunks efficiently
Single Database - No external dependencies required
Custom Paths - Configurable database location

Installation

From PyPI (Recommended)

# Install from PyPI
pip install project-vectorizer

# Verify installation
pv --version

From Source

# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer

# Install
pip install -e .

# Or with development dependencies
pip install -e ".[dev]"

Quick Start

1. Initialize Your Project

# 🚀 Recommended: Auto-optimize based on your system (16 workers, 400 batch on 8-core/16GB RAM)
pv init /path/to/project --optimize

# Or with custom settings
pv init /path/to/project \
  --name "My Project" \
  --embedding-model "all-MiniLM-L6-v2" \
  --chunk-size 256 \
  --optimize

Output:

✓ Project initialized successfully!

Name: My Project
Path: /path/to/project
Model: all-MiniLM-L6-v2
Provider: sentence-transformers
Chunk Size: 256 tokens

Optimized Settings:
  • Workers: 16
  • Batch Size: 400
  • Embedding Batch: 200
  • Memory Monitoring: Enabled
  • GC Interval: 100 files

2. Index Your Codebase

# 🚀 Recommended: First-time indexing with max resources (2-4x faster)
pv index /path/to/project --max-resources

# 🚀 Recommended: Smart incremental for updates (60-70% faster)
pv index /path/to/project --smart

# 🚀 Recommended: Git-aware for recent changes (80-90% faster)
pv index-git /path/to/project --since HEAD~5

# Standard full indexing
pv index /path/to/project

# Force re-index everything
pv index /path/to/project --force

# Combine for maximum performance
pv index /path/to/project --smart --max-resources

Output:

Using maximum system resources (optimized settings)...
  • Workers: 16
  • Batch Size: 400
  • Embedding Batch: 200

  Indexing examples/demo.py ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 100%

╭────────────────── Indexing Complete ──────────────────╮
│ ✓ Indexing complete!                                  │
│                                                       │
│ Files indexed: 48/49                                  │
│ Total chunks: 9222                                    │
│ Model: all-MiniLM-L6-v2                               │
│ Time taken: 2m 16s                                    │
│                                                       │
│ You can now search with: pv search . "your query"     │
╰───────────────────────────────────────────────────────╯

3. Search Your Code

# Natural language search
pv search /path/to/project "authentication logic"

# Single-word searches work great (high precision)
pv search /path/to/project "async" --threshold 0.8
pv search /path/to/project "test" --threshold 0.9

# Multi-word queries (semantic search)
pv search /path/to/project "user login validation" --threshold 0.5

# Find specific constructs
pv search /path/to/project "class" --limit 10

Output:

Search Results for: authentication logic

Found 5 result(s) with threshold >= 0.5

╭─────────────────────── Result 1 ───────────────────────╮
│ src/auth/login.py                                      │
│ Lines 45-67 | Similarity: 0.892                        │
│                                                        │
│ def authenticate_user(username: str, password: str):   │
│     """                                                │
│     Authenticate user credentials against database     │
│     Returns user object if valid, None otherwise       │
│     """                                                │
│     ...                                                │
╰────────────────────────────────────────────────────────╯

4. Start MCP Server

# Start server (default: localhost:8000)
pv serve /path/to/project

# Custom host/port
pv serve /path/to/project --host 0.0.0.0 --port 8080

5. Monitor Changes in Real-Time

# Watch for file changes (default 2s debounce)
pv sync /path/to/project --watch

# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5

# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0

Performance Optimization

Understanding the Optimization Flags

`--optimize` (Permanent)

Use when initializing a new project. Detects your system and saves optimal settings.

pv init /path/to/project --optimize

What it does:

Detects CPU cores → sets max_workers (e.g., 8 cores = 16 workers)
Calculates RAM → sets safe batch_size (e.g., 16GB = 400 batch)
Sets memory thresholds based on total RAM
Saves to config - All future operations use these settings

When to use:

✅ New projects
✅ Want permanent optimization
✅ Same machine for all operations
✅ "Set and forget" approach

`--max-resources` (Temporary)

Use when indexing to temporarily boost performance without changing config.

pv index /path/to/project --max-resources
pv index-git /path/to/project --since HEAD~1 --max-resources

What it does:

Detects system resources (same as --optimize)
Temporarily overrides config for this operation only
Original config unchanged

When to use:

✅ Existing project without optimization
✅ One-time heavy indexing
✅ CI/CD with dedicated resources
✅ Don't want to modify config

Performance Benchmarks

System: 8-core CPU, 16GB RAM, SSD

Mode	Files	Chunks	Time	Settings
Standard	48	9222	4m 32s	4 workers, 100 batch
--max-resources	48	9222	2m 16s	16 workers, 400 batch
Smart incremental	5 changed	412	24s	16 workers, 400 batch
Git-aware (HEAD~1)	3 changed	287	15s	16 workers, 400 batch

Key Findings:

--max-resources: 2x faster for full indexing
Smart incremental: 60-70% faster than full reindex
Git-aware: 80-90% faster for recent changes
Chunk size (128 vs 512): No performance difference (same ~2m 16s)

System Resource Detection

CPU Detection:

Detected: 8 cores
Optimal workers: min(8 * 2, 16) = 16 workers

Memory Detection:

Total RAM: 16GB
Available RAM: 8GB
Safe batch size: 8GB * 0.5 * 100 = 400
Embedding batch: 400 * 0.5 = 200
GC interval: 100 files

Memory Thresholds:

32GB+ RAM → threshold: 50000
16-32GB   → threshold: 20000
8-16GB    → threshold: 10000
<8GB      → threshold: 5000

Best Practices

Initialize with optimization
```
pv init ~/my-project --optimize
```

Use max resources for heavy operations

pv index ~/my-project --force --max-resources

Use smart mode for daily updates
```
pv index ~/my-project --smart
```

Use git-aware after pulling changes

pv index-git ~/my-project --since HEAD~1

Monitor memory with verbose mode

pv index ~/my-project --max-resources --verbose

CLI Commands

Global Options

pv [OPTIONS] COMMAND [ARGS]

Options:
  -v, --verbose    Enable verbose output
  --version        Show version
  --help           Show help

`pv init` - Initialize Project

Initialize a new project for vectorization.

pv init [OPTIONS] PROJECT_PATH

Options:
  -n, --name TEXT              Project name (default: directory name)
  -m, --embedding-model TEXT   Model name (default: all-MiniLM-L6-v2)
  -p, --embedding-provider     Provider: sentence-transformers | openai
  -c, --chunk-size INT         Chunk size in tokens (default: 256)
  -o, --chunk-overlap INT      Overlap in tokens (default: 32)
  --optimize                   Auto-optimize based on system resources ⭐

Examples:

# Basic initialization
pv init /path/to/project

# With optimization (recommended)
pv init /path/to/project --optimize

# With OpenAI embeddings
export OPENAI_API_KEY="sk-..."
pv init /path/to/project \
  --embedding-provider openai \
  --embedding-model text-embedding-ada-002 \
  --optimize

`pv index` - Index Codebase

Index the codebase for searching.

pv index [OPTIONS] PROJECT_PATH

Options:
  -i, --incremental      Only index changed files
  -s, --smart            Smart incremental (categorized: new/modified/deleted) ⭐
  -f, --force            Force re-index all files
  --max-resources        Use maximum system resources ⭐

Examples:

# Full indexing with max resources
pv index /path/to/project --max-resources

# Smart incremental (fastest for updates)
pv index /path/to/project --smart

# Combine for maximum performance
pv index /path/to/project --smart --max-resources

# Force complete reindex
pv index /path/to/project --force

`pv index-git` - Git-Aware Indexing

Index only files changed in git commits.

pv index-git [OPTIONS] PROJECT_PATH

Options:
  -s, --since TEXT       Git reference (default: HEAD~1)
  --max-resources        Use maximum system resources ⭐

Examples:

# Last commit
pv index-git /path/to/project --since HEAD~1

# Last 5 commits
pv index-git /path/to/project --since HEAD~5

# Since main branch
pv index-git /path/to/project --since main

# Since specific commit
pv index-git /path/to/project --since abc123def

# With max resources
pv index-git /path/to/project --since HEAD~10 --max-resources

Use Cases:

After git pull - index only new changes
Before code review - index PR changes
CI/CD pipelines - index commit range
After branch switch - index differences

`pv search` - Search Code

Search through vectorized codebase.

pv search [OPTIONS] PROJECT_PATH QUERY

Options:
  -l, --limit INT        Number of results (default: 10)
  -t, --threshold FLOAT  Similarity threshold 0.0-1.0 (default: 0.3)

Examples:

# Natural language search
pv search /path/to/project "error handling in database connections"

# Single-word search (high threshold)
pv search /path/to/project "async" --threshold 0.9

# Find all tests
pv search /path/to/project "test" --limit 20 --threshold 0.8

# Broad semantic search (low threshold)
pv search /path/to/project "api authentication" --threshold 0.3

Threshold Guide:

0.8-0.95: Single words, exact matches
0.5-0.7: Multi-word phrases, semantic
0.3-0.5: Complex queries, broad search
0.1-0.3: Very broad, exploratory

`pv sync` - Sync Changes / Watch Mode

Sync changes or watch for file modifications.

pv sync [OPTIONS] PROJECT_PATH

Options:
  -w, --watch           Watch for file changes
  -d, --debounce FLOAT  Debounce delay in seconds (default: 2.0)

Examples:

# One-time sync (smart incremental)
pv sync /path/to/project

# Watch mode with default debounce (2s)
pv sync /path/to/project --watch

# Fast feedback (0.5s)
pv sync /path/to/project --watch --debounce 0.5

# Slower systems (5s)
pv sync /path/to/project --watch --debounce 5.0

Debounce Explained:

Waits X seconds after last file change before indexing
Batches multiple rapid changes together
Prevents redundant indexing when saving files repeatedly
Reduces CPU usage during active development

Recommended Values:

0.5-1.0s: Fast machines, need instant feedback
2.0s: Balanced (default)
5.0-10.0s: Slower machines, large codebases

`pv serve` - Start MCP Server

Start MCP server for AI agent integration.

pv serve [OPTIONS] PROJECT_PATH

Options:
  -p, --port INT   Port number (default: 8000)
  -h, --host TEXT  Host address (default: localhost)

Examples:

# Start server
pv serve /path/to/project

# Custom port
pv serve /path/to/project --port 8080

# Expose to network
pv serve /path/to/project --host 0.0.0.0 --port 8000

`pv status` - Show Project Status

Show project status and statistics.

pv status PROJECT_PATH

Output:

╭────────────── Project Status ──────────────╮
│ Name              my-project               │
│ Path              /path/to/project         │
│ Embedding Model   all-MiniLM-L6-v2         │
│                                            │
│ Total Files       49                       │
│ Indexed Files     48                       │
│ Total Chunks      9222                     │
│                                            │
│ Git Branch        main                     │
│ Last Updated      2025-10-13 12:15:42      │
│ Created           2025-10-10 09:30:15      │
╰────────────────────────────────────────────╯

Configuration

Config File Location

Configuration is stored at <project>/.vectorizer/config.json

Full Configuration Reference

{
  "chromadb_path": null,
  "embedding_model": "all-MiniLM-L6-v2",
  "embedding_provider": "sentence-transformers",
  "openai_api_key": null,
  "chunk_size": 128,
  "chunk_overlap": 32,
  "max_file_size_mb": 10,
  "included_extensions": [
    ".py",
    ".js",
    ".ts",
    ".jsx",
    ".tsx",
    ".go",
    ".rs",
    ".java",
    ".cpp",
    ".c",
    ".h",
    ".hpp",
    ".cs",
    ".php",
    ".rb",
    ".swift",
    ".kt",
    ".scala",
    ".clj",
    ".sh",
    ".bash",
    ".zsh",
    ".fish",
    ".ps1",
    ".bat",
    ".cmd",
    ".md",
    ".txt",
    ".rst",
    ".json",
    ".yaml",
    ".yml",
    ".toml",
    ".xml",
    ".html",
    ".css",
    ".scss",
    ".sql",
    ".graphql",
    ".proto"
  ],
  "excluded_patterns": [
    "node_modules/**",
    ".git/**",
    "__pycache__/**",
    "*.pyc",
    ".pytest_cache/**",
    "venv/**",
    "env/**",
    ".env/**",
    "build/**",
    "dist/**",
    "*.egg-info/**",
    ".DS_Store",
    "*.min.js",
    "*.min.css"
  ],
  "mcp_host": "localhost",
  "mcp_port": 8000,
  "log_level": "INFO",
  "log_file": null,
  "max_workers": 4,
  "batch_size": 100,
  "embedding_batch_size": 100,
  "parallel_file_processing": true,
  "memory_monitoring_enabled": true,
  "memory_efficient_search_threshold": 10000,
  "gc_interval": 100
}

Key Settings Explained

Embedding Settings:

embedding_model: Model for embeddings (all-MiniLM-L6-v2, text-embedding-ada-002, etc.)
embedding_provider: "sentence-transformers" (local) or "openai" (API)
chunk_size: Tokens per chunk (128 for precision, 512 for context)
chunk_overlap: Overlap between chunks (16-32 recommended)

Performance Settings:

max_workers: Parallel workers (auto-detected with --optimize)
batch_size: Files per batch (auto-calculated with --optimize)
embedding_batch_size: Embeddings per batch
parallel_file_processing: Enable parallel processing (recommended: true)

Memory Settings:

memory_monitoring_enabled: Monitor RAM usage (recommended: true)
memory_efficient_search_threshold: Switch to streaming for large results
gc_interval: Garbage collection frequency (files between GC)

File Filtering:

included_extensions: File types to index
excluded_patterns: Glob patterns to ignore
max_file_size_mb: Skip files larger than this

Server Settings:

mcp_host: MCP server host
mcp_port: MCP server port
log_level: INFO, DEBUG, WARNING, ERROR
chromadb_path: Custom ChromaDB location (optional)

Environment Variables

Create .env file or export:

# OpenAI API Key (required for OpenAI embeddings)
export OPENAI_API_KEY="sk-..."

# Override config values
export EMBEDDING_PROVIDER="sentence-transformers"
export EMBEDDING_MODEL="all-MiniLM-L6-v2"
export CHUNK_SIZE="256"
export DEFAULT_SEARCH_THRESHOLD="0.3"

# Database
export CHROMADB_PATH="/custom/path/to/chromadb"

# Logging
export LOG_LEVEL="INFO"
export LOG_FILE="/var/log/vectorizer.log"

For complete list, see docs/ENVIRONMENT.md

Editing Configuration

# View current config
cat /path/to/project/.vectorizer/config.json

# Edit manually
nano /path/to/project/.vectorizer/config.json

# Or regenerate with optimization
pv init /path/to/project --optimize

Search Features

Single-Word Search

Optimized for high-precision single-keyword searches.

# Programming keywords
pv search /path/to/project "async" --threshold 0.9
pv search /path/to/project "test" --threshold 0.8
pv search /path/to/project "class" --threshold 0.9
pv search /path/to/project "import" --threshold 0.85

# Works great for finding specific constructs
pv search /path/to/project "def" --threshold 0.9  # Python functions
pv search /path/to/project "function" --threshold 0.9  # JS functions
pv search /path/to/project "catch" --threshold 0.8  # Error handling

Features:

Exact Word Matching: Prioritizes exact word boundaries
Keyword Detection: Special handling for programming keywords
Relevance Boosting: Huge boost for exact matches
High Thresholds: Reliable results even at 0.8-0.9+

Multi-Word Search

Semantic search for phrases and concepts.

# Natural language
pv search /path/to/project "user authentication logic" --threshold 0.5

# Code patterns
pv search /path/to/project "error handling in database" --threshold 0.4

# Features
pv search /path/to/project "rate limiting middleware" --threshold 0.6

Search Result Ranking

Results ranked by:

Exact word matches (highest priority)
Content type (micro/word chunks get boost)
Partial matches within larger words
Semantic similarity from embeddings

Recommended Thresholds by Query Type

Query Type	Threshold	Example
Single keyword	0.7-0.95	"async", "test", "class"
Two words	0.5-0.8	"error handling", "api routes"
Short phrase	0.4-0.7	"user login validation"
Complex query	0.3-0.5	"authentication with jwt tokens"
Exploratory	0.1-0.3	"machine learning model training"

MCP Server

Starting the Server

# Default (localhost:8000)
pv serve /path/to/project

# Custom settings
pv serve /path/to/project --host 0.0.0.0 --port 8080

Available MCP Tools

When running, AI agents can use these tools:

search_code - Search vectorized codebase

{
  "query": "authentication logic",
  "limit": 10,
  "threshold": 0.5
}

get_file_content - Retrieve full file
```
{
  "file_path": "src/auth/login.py"
}
```

list_files - List all files

{
  "file_type": "py" // optional filter
}

get_project_stats - Get statistics
```
{}
```

HTTP Fallback API

If MCP unavailable, HTTP endpoints provided:

# Search
curl "http://localhost:8000/search?q=authentication&limit=5&threshold=0.5"

# Get file
curl "http://localhost:8000/file/src/auth/login.py"

# List files
curl "http://localhost:8000/files?type=py"

# Statistics
curl "http://localhost:8000/stats"

# Health check
curl "http://localhost:8000/health"

Use Cases

AI Code Review: Let Claude analyze your codebase semantically
Intelligent Navigation: Ask AI to find relevant code
Documentation: Generate docs from actual code
Onboarding: Help new devs understand codebase
Refactoring: Find similar patterns across project

Advanced Usage

Python API

Basic Usage

import asyncio
from pathlib import Path
from project_vectorizer.core.config import Config
from project_vectorizer.core.project import ProjectManager

async def main():
    # Initialize project
    config = Config.create_optimized(
        embedding_model="all-MiniLM-L6-v2",
        chunk_size=256
    )

    project_path = Path("/path/to/project")
    manager = ProjectManager(project_path, config)

    # Initialize
    await manager.initialize("My Project")

    # Index
    await manager.load()
    await manager.index_all()

    # Search
    results = await manager.search("authentication", limit=10, threshold=0.5)
    for result in results:
        print(f"{result['file_path']}: {result['similarity']:.3f}")

asyncio.run(main())

Progress Tracking

from rich.progress import Progress, BarColumn, TaskProgressColumn

async def index_with_progress(project_path):
    config = Config.load_from_project(project_path)
    manager = ProjectManager(project_path, config)
    await manager.load()

    with Progress() as progress:
        task = progress.add_task("Indexing...", total=100)

        def update_progress(current, total, description):
            progress.update(task, completed=current, total=total, description=description)

        manager.set_progress_callback(update_progress)
        await manager.index_all()

Custom Resource Limits

import psutil

async def adaptive_index(project_path):
    """Index with resources based on current load."""
    cpu_percent = psutil.cpu_percent(interval=1)

    if cpu_percent < 50:  # System idle
        config = Config.create_optimized()
    else:  # System busy
        config = Config(max_workers=4, batch_size=100)

    manager = ProjectManager(project_path, config)
    await manager.load()
    await manager.index_all()

Chunk Size Optimization

The engine enforces a maximum of 128 tokens per chunk (see engine.py:35) for precision, but you can configure larger sizes for more context:

# Precision (default, forced max 128)
pv init /path/to/project --chunk-size 128

# More context (still capped at 128 by engine)
pv init /path/to/project --chunk-size 512

Performance Note: Chunk size has virtually NO impact on indexing speed (~2m 16s for both 128 and 512 tokens). Choose based on search quality needs:

128: Better precision, exact matches
512: More context, better understanding

CI/CD Integration

# .github/workflows/vectorize.yml
name: Vectorize Codebase

on:
  push:
    branches: [main]

jobs:
  vectorize:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3

      - name: Setup Python
        uses: actions/setup-python@v4
        with:
          python-version: "3.9"

      - name: Install vectorizer
        run: pip install project-vectorizer

      - name: Initialize and index
        run: |
          pv init . --optimize --name "${{ github.repository }}"
          pv index . --max-resources

      - name: Test search
        run: pv search . "test" --limit 5

Custom File Filters

{
  "included_extensions": [".py", ".js", ".custom"],
  "excluded_patterns": ["tests/**", "*.generated.js", "vendor/**", "*.min.*"]
}

Watch Mode During Development

# Terminal 1: Watch mode
pv sync /path/to/project --watch --debounce 1.0

# Terminal 2: Make code changes
# Auto-indexes when you save

# Terminal 3: Search as you code
pv search /path/to/project "your new function" --threshold 0.5

Troubleshooting

Common Issues

1. Slow Indexing

Problem: Indexing taking too long

Solutions:

# Use max resources
pv index /path/to/project --max-resources

# Use smart incremental for updates
pv index /path/to/project --smart

# Use git-aware for recent changes
pv index-git /path/to/project --since HEAD~1

# Check if optimization is working
pv index /path/to/project --max-resources --verbose
# Look for: "Workers: 16, Batch Size: 400"

2. High Memory Usage

Problem: Process using too much RAM or getting killed

Solutions:

# Reduce batch size in config
{
  "batch_size": 50,
  "max_workers": 4
}

# Enable memory monitoring
{
  "memory_monitoring_enabled": true,
  "gc_interval": 50
}

# Use smaller chunks
pv init /path/to/project --chunk-size 128

3. Poor Search Results

Problem: Search not finding relevant code

Solutions:

# Lower threshold for phrases
pv search /path/to/project "your query" --threshold 0.3

# Higher threshold for keywords
pv search /path/to/project "async" --threshold 0.9

# Use smaller chunk size for precision
# Edit config: "chunk_size": 128

# Ensure index is up to date
pv index /path/to/project --smart

4. No Results for Single Words

Problem: Single-word searches return nothing

Solutions:

# Try lower threshold
pv search /path/to/project "yourword" --threshold 0.5

# Check if word exists
pv search /path/to/project "yourword" --threshold 0.1 --limit 1

# Reindex with smaller chunks
# Edit config: "chunk_size": 128
pv index /path/to/project --force

5. Missing Recent Changes

Problem: Just-edited code not showing in search

Solutions:

# Run smart incremental
pv index /path/to/project --smart

# Or git-aware
pv index-git /path/to/project --since HEAD~1

# Check status
pv status /path/to/project

6. psutil Not Found

Problem: Optimization not working

Solution:

# Install psutil
pip install psutil

# Verify
python -c "import psutil; print(f'CPUs: {psutil.cpu_count()}, RAM: {psutil.virtual_memory().available / 1024**3:.1f}GB')"

# Try again
pv init /path/to/project --optimize

Debug Mode

# Enable verbose logging
pv --verbose index /path/to/project

# Check project status
pv status /path/to/project

# View config
cat /path/to/project/.vectorizer/config.json

# Check ChromaDB
ls -lh /path/to/project/.vectorizer/chromadb/

Performance Debugging

# Time operations
time pv index /path/to/project
time pv index /path/to/project --max-resources

# Monitor resources during indexing
# Terminal 1:
pv index /path/to/project --max-resources

# Terminal 2:
htop  # or top
# Should see high CPU across all cores

# Check memory warnings
pv index /path/to/project --max-resources --verbose
# Look for memory warnings

Changelog

[0.1.4] - 2025-10-13

Fixed

**Hardcoded value ** – Replaced hardcoded configuration with dynamic variable lookup

Notes

This is a minor bugfix release with no API or CLI changes.

[0.1.3] - 2025-10-13

Fixed

Hardcoded value – Replaced hardcoded configuration with dynamic variable lookup
- Prevents unexpected behavior when running with custom configs

Notes

This is a minor bugfix release with no API or CLI changes.

[0.1.2] - 2025-10-13

Added

Optimized Config Generation - Config.create_optimized() auto-detects CPU/RAM
Max Resources Flag - --max-resources for temporary performance boost
psutil Integration - Automatic system resource detection
Unified Progress Tracking - Clean single-line progress bar
Library Progress Suppression - No more cluttered batch progress bars
Timing Information - All operations show elapsed time
Clean Terminal Output - Professional UI with timing

Performance

2x faster full indexing with --max-resources
60-70% faster smart incremental updates
80-90% faster git-aware indexing

Documentation

Comprehensive documentation overhaul
Consolidated all guides into main README
Added CHANGELOG.md with version history

[0.1.1] - 2025-10-12

Enhanced single-word search with high precision
Multi-level chunking (micro + word-level)
Adaptive search thresholds
Programming keyword detection
Improved word matching and relevance boosting

[0.1.0] - 2025-10-10

Initial release
Code vectorization
Smart incremental indexing
Git-aware indexing
MCP server
Watch mode
ChromaDB backend
30+ language support

Contributing

Development Setup

# Clone repository
git clone https://github.com/starkbaknet/project-vectorizer.git
cd project-vectorizer

# Create virtual environment
python -m venv venv
source venv/bin/activate  # Windows: venv\Scripts\activate

# Install with dev dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Format code
black .
isort .

Running Tests

# All tests
pytest

# With coverage
pytest --cov=project_vectorizer

# Specific test
pytest tests/test_config.py

# Verbose
pytest -v

See docs/TESTING.md for details.

Publishing

See docs/PUBLISHING.md for PyPI publishing guide.

Contributing Guidelines

Fork repository
Create feature branch: git checkout -b feature/amazing-feature
Make changes and add tests
Ensure tests pass: pytest
Format code: black . && isort .
Commit: git commit -m 'Add amazing feature'
Push: git push origin feature/amazing-feature
Open Pull Request

License

MIT License - see LICENSE file

Additional Resources

GitHub: https://github.com/starkbaknet/project-vectorizer
PyPI: https://pypi.org/project/project-vectorizer/
Issues: https://github.com/starkbaknet/project-vectorizer/issues

Made with ❤️ by StarkBakNet

Vectorize your codebase. Empower your AI agents. Build better software.

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.gemini		.gemini
.github/workflows		.github/workflows
docs		docs
examples		examples
project_vectorizer		project_vectorizer
tests		tests
.env.example		.env.example
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
install_and_setup.sh		install_and_setup.sh
install_and_test.sh		install_and_test.sh
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
test_single_word_search.py		test_single_word_search.py

License

starkbaknet/project-vectorizer

Folders and files

Latest commit

History

Repository files navigation

Project Vectorizer

📋 Table of Contents

Features

🚀 Performance & Optimization

🔍 Search & Indexing

🔄 Change Management

🌐 AI Integration

🎨 User Experience

💾 Database & Storage

Installation

From PyPI (Recommended)

From Source

Quick Start

1. Initialize Your Project

2. Index Your Codebase

3. Search Your Code

4. Start MCP Server

5. Monitor Changes in Real-Time

Performance Optimization

Understanding the Optimization Flags

--optimize (Permanent)

--max-resources (Temporary)

Performance Benchmarks

System Resource Detection

Best Practices

CLI Commands

Global Options

pv init - Initialize Project

pv index - Index Codebase

pv index-git - Git-Aware Indexing

pv search - Search Code

pv sync - Sync Changes / Watch Mode

pv serve - Start MCP Server

pv status - Show Project Status

Configuration

Config File Location

Full Configuration Reference

Key Settings Explained

Environment Variables

Editing Configuration

Search Features

Single-Word Search

Multi-Word Search

Search Result Ranking

Recommended Thresholds by Query Type

MCP Server

Starting the Server

Available MCP Tools

HTTP Fallback API

Use Cases

Advanced Usage

Python API

Basic Usage

Progress Tracking

Custom Resource Limits

Chunk Size Optimization

CI/CD Integration

Custom File Filters

Watch Mode During Development

Troubleshooting

Common Issues

1. Slow Indexing

2. High Memory Usage

3. Poor Search Results

4. No Results for Single Words

5. Missing Recent Changes

6. psutil Not Found

Debug Mode

Performance Debugging

Changelog

[0.1.4] - 2025-10-13

Fixed

Notes

[0.1.3] - 2025-10-13

`--optimize` (Permanent)

`--max-resources` (Temporary)

`pv init` - Initialize Project

`pv index` - Index Codebase

`pv index-git` - Git-Aware Indexing

`pv search` - Search Code

`pv sync` - Sync Changes / Watch Mode

`pv serve` - Start MCP Server

`pv status` - Show Project Status