Skip to content

Implementation Plan: Workspace-Wide Vector Database with Kong AI Gateway

Author Christopher Blaisdell
Date 2026-03-14
Status PROPOSED
Relates To Vector DB / RAG Feasibility Analysis
Phase Phase 2 - AI Workflow Enhancement
See Also Context Window Utilization Analysis, Deep Research: Copilot vs Kong+Roo Economics

Executive Summary

This plan describes how to build a workspace-wide vector database that indexes the entire NovaTrek architecture workspace and exposes it to AI coding assistants (Roo Code, Copilot, Continue.dev) via MCP protocol, with Kong AI Gateway managing all LLM and embedding API traffic for cost control, observability, prompt injection of retrieved context, and multi-provider failover.

The key insight: Kong AI Gateway cannot be the vector database or retrieval engine, but it is the correct place to manage the AI API traffic that powers the system. The architecture places each component in its proper layer:

Layer Component Responsibility
Storage ChromaDB (local) Vector storage, similarity search, metadata filtering
Indexing Python chunking pipeline Format-aware document splitting, embedding generation
Retrieval MCP Server (Python) Query interface for AI agents via MCP protocol
AI Gateway Kong AI Gateway Route embedding + LLM calls, cost tracking, prompt decoration, guardrails
Inference OpenAI / Anthropic / Ollama Embedding models + LLM reasoning
Client Roo Code / VS Code AI assistant consuming the MCP tool

Architecture

System Context (C4 Level 1)

                    +------------------+
                    |    Architect      |
                    |  (VS Code User)  |
                    +--------+---------+
                             |
                    +--------v---------+
                    |    Roo Code /    |
                    |  Continue.dev /  |
                    |     Copilot      |
                    +--------+---------+
                             | MCP Protocol
                    +--------v---------+
                    | MCP Vector Server|
                    |  (Python, local) |
                    +---+----+----+----+
                        |    |    |
            +-----------+    |    +------------+
            |                |                 |
   +--------v-------+  +----v--------+  +-----v----------+
   |   ChromaDB     |  | Kong AI GW  |  |  File Watcher  |
   | (local vector  |  | (API proxy) |  |  (fswatch /    |
   |   database)    |  +----+--------+  |   watchdog)    |
   +----------------+       |           +----------------+
                        +---+---+
                        |       |
               +--------v+  +--v----------+
               | Embedding|  | LLM Inference|
               | Provider |  | Provider     |
               | (OpenAI/ |  | (Anthropic/  |
               |  Ollama) |  |  OpenAI)     |
               +----------+  +-------------+

Data Flow

Indexing Flow (background, on file change):

File saved in workspace
  -> File watcher detects change
  -> Chunking pipeline splits file (format-aware)
  -> Chunks sent to Kong AI Gateway /embeddings endpoint
  -> Kong routes to embedding provider (OpenAI or Ollama)
  -> Kong logs: model, tokens, cost, latency
  -> Embedding vectors returned
  -> ChromaDB upserts vectors with metadata (file path, line range, content type)

Query Flow (on-demand, during AI agent reasoning):

AI agent calls MCP tool: search("which services call svc-guest-profiles?")
  -> MCP server embeds query via Kong AI Gateway /embeddings
  -> MCP server queries ChromaDB for top-k similar chunks
  -> MCP server returns ranked results with file paths + content snippets
  -> AI agent uses retrieved context in its reasoning
  -> AI agent's LLM call routes through Kong AI Gateway /chat/completions
  -> Kong logs: full request cost, latency, model, token counts

Kong AI Gateway's Role (Specifically)

Kong AI does not perform retrieval. It provides five critical infrastructure services for this pipeline:

Service How Kong AI Delivers It
Unified embedding API Kong's AI Proxy plugin exposes a single /embeddings endpoint. Backend can be switched between OpenAI, Cohere, or local Ollama without changing any client code
Cost tracking Every embedding call and every LLM inference call passes through Kong. The AI Observability plugin logs token counts, model name, latency, and estimated cost per request. This gives exact cost-per-index-run and cost-per-query metrics
Prompt decoration Kong's AI Prompt Decorator plugin can inject a system prompt prefix into every LLM call (e.g., "You have access to a workspace vector search tool. Use it before reading files manually."). This steers agent behavior without modifying the AI assistant's configuration
Rate limiting Kong's Rate Limiting plugin prevents runaway re-indexing from consuming excessive embedding API quota (e.g., max 1,000 embedding requests per minute)
Multi-provider failover If OpenAI's embedding endpoint is down, Kong automatically routes to the fallback provider (Cohere or local Ollama) with no client-side changes

Component Design

Component 1: Chunking Pipeline

Purpose: Split workspace files into semantically meaningful chunks suitable for embedding.

Location: scripts/vector-db/chunker.py

Format-aware chunking rules:

File Type Chunking Strategy Expected Chunk Size
Markdown (.md) Split by H2 (##) headers. Each section becomes one chunk. Front matter (title, metadata table) stays attached to the first chunk 200-1500 tokens
YAML — OpenAPI specs Split by path + operation. Each paths./endpoint.method block becomes one chunk. info and components/schemas are separate chunks 100-800 tokens
YAML — metadata files Split by top-level key. Each capability, ticket, or event definition becomes one chunk 50-400 tokens
AsyncAPI (.yaml) Split by channel. Each channel + message schema becomes one chunk 100-500 tokens
Java (.java) Split by class method. Each method (with its Javadoc) becomes one chunk. Class-level annotations and imports stay with the first chunk 100-1000 tokens
PlantUML (.puml) Entire file as one chunk (these are small) 50-200 tokens
ADR (.md) Split by MADR section (Context, Decision Drivers, Options, Outcome, Consequences) 100-500 tokens

Metadata per chunk:

{
    "file_path": "architecture/specs/svc-check-in.yaml",
    "file_type": "openapi",
    "chunk_type": "endpoint",           # endpoint | schema | section | method | definition
    "section_heading": "POST /check-ins",
    "line_start": 45,
    "line_end": 98,
    "service": "svc-check-in",          # extracted from path or spec content
    "domain": "Operations",             # from DOMAINS mapping
    "last_modified": "2026-03-12T14:30:00Z"
}

Implementation:

# scripts/vector-db/chunker.py

import os
import yaml
import re
from dataclasses import dataclass
from typing import Generator
from pathlib import Path

@dataclass
class Chunk:
    content: str
    file_path: str
    file_type: str
    chunk_type: str
    section_heading: str
    line_start: int
    line_end: int
    metadata: dict

SKIP_DIRS = {'.git', 'node_modules', '.venv', 'site', '__pycache__', '.mypy_cache'}
SKIP_FILES = {'.DS_Store', '.gitignore', '.env'}

def chunk_markdown(file_path: str, content: str) -> Generator[Chunk, None, None]:
    """Split Markdown by H2 headers."""
    lines = content.split('\n')
    current_section = []
    current_heading = "Preamble"
    section_start = 1

    for i, line in enumerate(lines, 1):
        if line.startswith('## ') and current_section:
            yield Chunk(
                content='\n'.join(current_section),
                file_path=file_path,
                file_type='markdown',
                chunk_type='section',
                section_heading=current_heading,
                line_start=section_start,
                line_end=i - 1,
                metadata={}
            )
            current_section = [line]
            current_heading = line.lstrip('# ').strip()
            section_start = i
        else:
            current_section.append(line)

    if current_section:
        yield Chunk(
            content='\n'.join(current_section),
            file_path=file_path,
            file_type='markdown',
            chunk_type='section',
            section_heading=current_heading,
            line_start=section_start,
            line_end=len(lines),
            metadata={}
        )

def chunk_openapi(file_path: str, content: str) -> Generator[Chunk, None, None]:
    """Split OpenAPI YAML by path+operation."""
    try:
        spec = yaml.safe_load(content)
    except yaml.YAMLError:
        yield Chunk(
            content=content,
            file_path=file_path,
            file_type='openapi',
            chunk_type='full_file',
            section_heading=os.path.basename(file_path),
            line_start=1,
            line_end=content.count('\n') + 1,
            metadata={}
        )
        return

    # Info block
    if 'info' in spec:
        info_yaml = yaml.dump({'info': spec['info']}, default_flow_style=False)
        yield Chunk(
            content=info_yaml,
            file_path=file_path,
            file_type='openapi',
            chunk_type='info',
            section_heading=spec.get('info', {}).get('title', 'API Info'),
            line_start=1,
            line_end=1,
            metadata={'service': _extract_service(file_path)}
        )

    # Each path+operation
    for path, methods in (spec.get('paths') or {}).items():
        for method, operation in methods.items():
            if method.startswith('x-'):
                continue
            op_yaml = yaml.dump(
                {path: {method: operation}},
                default_flow_style=False
            )
            summary = operation.get('summary', f'{method.upper()} {path}')
            yield Chunk(
                content=op_yaml,
                file_path=file_path,
                file_type='openapi',
                chunk_type='endpoint',
                section_heading=f'{method.upper()} {path} -- {summary}',
                line_start=1,
                line_end=1,
                metadata={'service': _extract_service(file_path)}
            )

    # Schemas
    schemas = (spec.get('components') or {}).get('schemas') or {}
    for name, schema in schemas.items():
        schema_yaml = yaml.dump({name: schema}, default_flow_style=False)
        yield Chunk(
            content=schema_yaml,
            file_path=file_path,
            file_type='openapi',
            chunk_type='schema',
            section_heading=f'Schema: {name}',
            line_start=1,
            line_end=1,
            metadata={'service': _extract_service(file_path)}
        )

def chunk_java(file_path: str, content: str) -> Generator[Chunk, None, None]:
    """Split Java by method boundaries."""
    # Simplified: split on method-level patterns
    lines = content.split('\n')
    method_pattern = re.compile(
        r'^\s+(public|private|protected)\s+\S+\s+\w+\s*\('
    )
    current_block = []
    block_start = 1
    current_heading = os.path.basename(file_path)

    for i, line in enumerate(lines, 1):
        if method_pattern.match(line) and current_block:
            yield Chunk(
                content='\n'.join(current_block),
                file_path=file_path,
                file_type='java',
                chunk_type='method',
                section_heading=current_heading,
                line_start=block_start,
                line_end=i - 1,
                metadata={}
            )
            current_block = [line]
            current_heading = line.strip()
            block_start = i
        else:
            current_block.append(line)

    if current_block:
        yield Chunk(
            content='\n'.join(current_block),
            file_path=file_path,
            file_type='java',
            chunk_type='method',
            section_heading=current_heading,
            line_start=block_start,
            line_end=len(lines),
            metadata={}
        )

def _extract_service(file_path: str) -> str:
    """Extract service name from file path."""
    parts = Path(file_path).parts
    for part in parts:
        if part.startswith('svc-'):
            return part
    stem = Path(file_path).stem
    if stem.startswith('svc-'):
        return stem
    return ''

def chunk_workspace(workspace_root: str) -> Generator[Chunk, None, None]:
    """Walk workspace and yield all chunks."""
    for dirpath, dirnames, filenames in os.walk(workspace_root):
        dirnames[:] = [d for d in dirnames if d not in SKIP_DIRS]

        for filename in filenames:
            if filename in SKIP_FILES:
                continue

            file_path = os.path.join(dirpath, filename)
            rel_path = os.path.relpath(file_path, workspace_root)

            try:
                with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
                    content = f.read()
            except (PermissionError, IsADirectoryError):
                continue

            if not content.strip():
                continue

            ext = Path(filename).suffix.lower()

            if ext == '.md':
                yield from chunk_markdown(rel_path, content)
            elif ext in ('.yaml', '.yml'):
                # Detect OpenAPI vs plain YAML
                if 'openapi:' in content[:500]:
                    yield from chunk_openapi(rel_path, content)
                else:
                    # Plain YAML: single chunk
                    yield Chunk(
                        content=content,
                        file_path=rel_path,
                        file_type='yaml',
                        chunk_type='full_file',
                        section_heading=filename,
                        line_start=1,
                        line_end=content.count('\n') + 1,
                        metadata={}
                    )
            elif ext == '.java':
                yield from chunk_java(rel_path, content)
            elif ext in ('.puml', '.plantuml'):
                yield Chunk(
                    content=content,
                    file_path=rel_path,
                    file_type='plantuml',
                    chunk_type='full_file',
                    section_heading=filename,
                    line_start=1,
                    line_end=content.count('\n') + 1,
                    metadata={}
                )

Component 2: ChromaDB Vector Store

Purpose: Store embeddings locally with metadata filtering and similarity search.

Location: scripts/vector-db/store.py

Why ChromaDB:

Criterion ChromaDB LanceDB Qdrant FAISS
Local-first (no server needed) Yes (persistent mode) Yes Needs Docker Yes
Metadata filtering Yes Yes Yes No
Python SDK quality Excellent Good Good Minimal
Incremental upsert Yes (by ID) Yes Yes Manual
Hybrid search (vector + keyword) Yes (with where_document) No Yes No
Disk footprint <100 MB for this workspace <50 MB ~200 MB (Docker) <50 MB
Setup complexity pip install chromadb pip install lancedb Docker container pip install faiss-cpu

Implementation:

# scripts/vector-db/store.py

import chromadb
import hashlib
from pathlib import Path

DB_PATH = ".vector-db"
COLLECTION_NAME = "novatrek-workspace"

def get_collection():
    client = chromadb.PersistentClient(path=DB_PATH)
    return client.get_or_create_collection(
        name=COLLECTION_NAME,
        metadata={"hnsw:space": "cosine"}
    )

def chunk_id(file_path: str, line_start: int, section_heading: str) -> str:
    """Deterministic ID for upsert idempotency."""
    raw = f"{file_path}:{line_start}:{section_heading}"
    return hashlib.sha256(raw.encode()).hexdigest()[:16]

def upsert_chunks(chunks, embeddings):
    """Upsert chunk embeddings into ChromaDB."""
    collection = get_collection()
    ids = []
    documents = []
    metadatas = []

    for chunk in chunks:
        ids.append(chunk_id(chunk.file_path, chunk.line_start, chunk.section_heading))
        documents.append(chunk.content)
        metadatas.append({
            "file_path": chunk.file_path,
            "file_type": chunk.file_type,
            "chunk_type": chunk.chunk_type,
            "section_heading": chunk.section_heading,
            "line_start": chunk.line_start,
            "line_end": chunk.line_end,
            **chunk.metadata
        })

    # ChromaDB handles batching internally
    collection.upsert(
        ids=ids,
        embeddings=embeddings,
        documents=documents,
        metadatas=metadatas
    )

def query(embedding, top_k=5, where_filter=None):
    """Query ChromaDB for similar chunks."""
    collection = get_collection()
    kwargs = {
        "query_embeddings": [embedding],
        "n_results": top_k,
        "include": ["documents", "metadatas", "distances"]
    }
    if where_filter:
        kwargs["where"] = where_filter

    return collection.query(**kwargs)

def delete_by_file(file_path: str):
    """Remove all chunks for a file (before re-indexing)."""
    collection = get_collection()
    collection.delete(where={"file_path": file_path})

def get_stats():
    """Return collection statistics."""
    collection = get_collection()
    return {
        "total_chunks": collection.count(),
        "collection": COLLECTION_NAME
    }

Component 3: Kong AI Gateway (Docker)

Purpose: Central proxy for all embedding and LLM API calls. Provides cost tracking, rate limiting, prompt decoration, and multi-provider failover.

Location: Added to docker-compose.yml

Docker Compose addition:

  # ---------------------------------------------------------------------------
  # Kong AI Gateway (manages embedding + LLM API traffic)
  # ---------------------------------------------------------------------------
  kong:
    image: kong/kong-gateway:3.9
    container_name: novatrek-kong-ai
    environment:
      KONG_DATABASE: "off"
      KONG_DECLARATIVE_CONFIG: /etc/kong/kong.yml
      KONG_PROXY_LISTEN: "0.0.0.0:8000"
      KONG_ADMIN_LISTEN: "0.0.0.0:8001"
      KONG_LOG_LEVEL: info
    ports:
      - "8000:8000"   # Proxy (AI API calls go here)
      - "8001:8001"   # Admin API
    volumes:
      - ./config/kong/kong.yml:/etc/kong/kong.yml:ro
    healthcheck:
      test: ["CMD", "kong", "health"]
      interval: 10s
      timeout: 5s
      retries: 5

Kong declarative config (config/kong/kong.yml):

_format_version: "3.0"

services:
  # ===== Embedding Provider: OpenAI =====
  - name: openai-embeddings
    url: https://api.openai.com/v1
    routes:
      - name: embeddings-route
        paths:
          - /ai/embeddings
        strip_path: true
    plugins:
      - name: ai-proxy
        config:
          route_type: llm/v1/embeddings
          auth:
            header_name: Authorization
            header_value: "Bearer ${OPENAI_API_KEY}"
          model:
            provider: openai
            name: text-embedding-3-small
      - name: rate-limiting
        config:
          minute: 500
          policy: local
      - name: ai-prompt-decorator
        config:
          prepend:
            - role: system
              content: >
                You are indexing an architecture workspace for the NovaTrek
                Adventures platform. Embeddings are used for semantic search
                over OpenAPI specs, ADRs, solution designs, and service metadata.

  # ===== Embedding Provider: Ollama (local fallback) =====
  - name: ollama-embeddings
    url: http://host.docker.internal:11434/v1
    routes:
      - name: embeddings-local-route
        paths:
          - /ai/embeddings/local
        strip_path: true
    plugins:
      - name: rate-limiting
        config:
          minute: 1000
          policy: local

  # ===== LLM Inference: Anthropic (for Roo Code RAG-augmented calls) =====
  - name: anthropic-chat
    url: https://api.anthropic.com/v1
    routes:
      - name: chat-route
        paths:
          - /ai/chat
        strip_path: true
    plugins:
      - name: ai-proxy
        config:
          route_type: llm/v1/chat
          auth:
            header_name: x-api-key
            header_value: "${ANTHROPIC_API_KEY}"
          model:
            provider: anthropic
            name: claude-sonnet-4-20250514
      - name: rate-limiting
        config:
          minute: 100
          policy: local
      - name: ai-prompt-decorator
        config:
          prepend:
            - role: system
              content: >
                You have access to a workspace vector search tool via MCP.
                When investigating architecture questions, query the vector
                database before reading files directly. The workspace contains
                19 microservice OpenAPI specs, 11 ADRs, event schemas, and
                solution designs for the NovaTrek Adventures platform.

  # ===== LLM Inference: OpenAI (fallback) =====
  - name: openai-chat
    url: https://api.openai.com/v1
    routes:
      - name: chat-fallback-route
        paths:
          - /ai/chat/openai
        strip_path: true
    plugins:
      - name: ai-proxy
        config:
          route_type: llm/v1/chat
          auth:
            header_name: Authorization
            header_value: "Bearer ${OPENAI_API_KEY}"
          model:
            provider: openai
            name: gpt-4.1
      - name: rate-limiting
        config:
          minute: 100
          policy: local

What Kong AI tracks for every request:

{
  "request.model": "text-embedding-3-small",
  "request.provider": "openai",
  "response.tokens.input": 1523,
  "response.tokens.output": 0,
  "response.latency_ms": 142,
  "response.cost_usd": 0.0000305,
  "consumer": "vector-indexer",
  "route": "embeddings-route",
  "timestamp": "2026-03-14T15:30:42Z"
}

Component 4: Embedding Client (via Kong AI)

Purpose: Generate embeddings by calling Kong AI Gateway's unified /ai/embeddings endpoint.

Location: scripts/vector-db/embedder.py

# scripts/vector-db/embedder.py

import os
import requests
from typing import Optional

KONG_BASE_URL = os.environ.get("KONG_AI_URL", "http://localhost:8000")

def embed_texts(texts: list[str], provider: str = "openai") -> list[list[float]]:
    """Generate embeddings via Kong AI Gateway."""
    if provider == "local":
        url = f"{KONG_BASE_URL}/ai/embeddings/local"
        payload = {
            "model": "nomic-embed-text",
            "input": texts
        }
    else:
        url = f"{KONG_BASE_URL}/ai/embeddings"
        payload = {
            "model": "text-embedding-3-small",
            "input": texts
        }

    response = requests.post(url, json=payload, timeout=30)
    response.raise_for_status()

    data = response.json()
    return [item["embedding"] for item in data["data"]]

def embed_query(query: str, provider: str = "openai") -> list[float]:
    """Embed a single query string."""
    return embed_texts([query], provider=provider)[0]

Component 5: MCP Server

Purpose: Expose vector search as an MCP tool that Roo Code (and other MCP-compatible clients) can call autonomously.

Location: scripts/vector-db/mcp_server.py

MCP Tool Definition:

{
  "name": "workspace_search",
  "description": "Semantic search across the entire NovaTrek architecture workspace. Searches OpenAPI specs, ADRs, solution designs, event schemas, capability metadata, and Java source code. Returns the top-k most relevant chunks with file paths and line numbers. Use this BEFORE reading files to find relevant context efficiently.",
  "inputSchema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
      },
      "top_k": {
        "type": "integer",
        "default": 5,
        "description": "Number of results to return (1-20)"
      },
      "file_type": {
        "type": "string",
        "enum": ["markdown", "openapi", "yaml", "java", "plantuml"],
        "description": "Optional filter to restrict search to a specific file type"
      },
      "service": {
        "type": "string",
        "description": "Optional filter to restrict search to a specific service (e.g., 'svc-check-in')"
      }
    },
    "required": ["query"]
  }
}

Implementation:

# scripts/vector-db/mcp_server.py

import asyncio
import json
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent

from store import query as vector_query, get_stats
from embedder import embed_query

app = Server("novatrek-workspace-search")

@app.list_tools()
async def list_tools():
    return [
        Tool(
            name="workspace_search",
            description=(
                "Semantic search across the entire NovaTrek architecture workspace. "
                "Searches OpenAPI specs, ADRs, solution designs, event schemas, "
                "capability metadata, and Java source code. Returns the top-k most "
                "relevant chunks with file paths and line numbers. "
                "Use this BEFORE reading files to find relevant context efficiently."
            ),
            inputSchema={
                "type": "object",
                "properties": {
                    "query": {
                        "type": "string",
                        "description": "Natural language search query"
                    },
                    "top_k": {
                        "type": "integer",
                        "default": 5,
                        "description": "Number of results to return (1-20)"
                    },
                    "file_type": {
                        "type": "string",
                        "enum": ["markdown", "openapi", "yaml", "java", "plantuml"],
                        "description": "Optional: restrict to file type"
                    },
                    "service": {
                        "type": "string",
                        "description": "Optional: restrict to service (e.g., svc-check-in)"
                    }
                },
                "required": ["query"]
            }
        ),
        Tool(
            name="workspace_index_stats",
            description="Get statistics about the workspace vector index",
            inputSchema={
                "type": "object",
                "properties": {}
            }
        )
    ]

@app.call_tool()
async def call_tool(name: str, arguments: dict):
    if name == "workspace_search":
        query_text = arguments["query"]
        top_k = min(arguments.get("top_k", 5), 20)

        # Build metadata filter
        where_filter = {}
        if "file_type" in arguments:
            where_filter["file_type"] = arguments["file_type"]
        if "service" in arguments:
            where_filter["service"] = arguments["service"]

        # Embed query via Kong AI Gateway
        query_embedding = embed_query(query_text)

        # Search ChromaDB
        results = vector_query(
            embedding=query_embedding,
            top_k=top_k,
            where_filter=where_filter if where_filter else None
        )

        # Format results
        output_lines = [f"## Search Results for: \"{query_text}\"\n"]
        for i, (doc, meta, dist) in enumerate(zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )):
            score = 1 - dist  # cosine distance to similarity
            output_lines.append(
                f"### Result {i+1} (similarity: {score:.3f})\n"
                f"**File:** {meta['file_path']} "
                f"(lines {meta['line_start']}-{meta['line_end']})\n"
                f"**Type:** {meta['file_type']} / {meta['chunk_type']}\n"
                f"**Section:** {meta['section_heading']}\n\n"
                f"```\n{doc[:500]}{'...' if len(doc) > 500 else ''}\n```\n"
            )

        return [TextContent(type="text", text='\n'.join(output_lines))]

    elif name == "workspace_index_stats":
        stats = get_stats()
        return [TextContent(
            type="text",
            text=json.dumps(stats, indent=2)
        )]

    return [TextContent(type="text", text=f"Unknown tool: {name}")]

async def main():
    async with stdio_server() as (read_stream, write_stream):
        await app.run(read_stream, write_stream)

if __name__ == "__main__":
    asyncio.run(main())

Component 6: File Watcher (Incremental Re-indexing)

Purpose: Detect file changes and re-index only the modified files.

Location: scripts/vector-db/watcher.py

# scripts/vector-db/watcher.py

import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from pathlib import Path

from chunker import chunk_markdown, chunk_openapi, chunk_java, SKIP_DIRS
from embedder import embed_texts
from store import upsert_chunks, delete_by_file

WATCH_EXTENSIONS = {'.md', '.yaml', '.yml', '.java', '.puml'}

class WorkspaceHandler(FileSystemEventHandler):
    def __init__(self, workspace_root: str):
        self.workspace_root = workspace_root

    def on_modified(self, event):
        if event.is_directory:
            return
        self._reindex(event.src_path)

    def on_created(self, event):
        if event.is_directory:
            return
        self._reindex(event.src_path)

    def on_deleted(self, event):
        if event.is_directory:
            return
        rel_path = str(Path(event.src_path).relative_to(self.workspace_root))
        delete_by_file(rel_path)

    def _reindex(self, abs_path: str):
        path = Path(abs_path)
        if path.suffix not in WATCH_EXTENSIONS:
            return
        if any(skip in path.parts for skip in SKIP_DIRS):
            return

        rel_path = str(path.relative_to(self.workspace_root))

        try:
            content = path.read_text(encoding='utf-8', errors='ignore')
        except (PermissionError, FileNotFoundError):
            return

        if not content.strip():
            return

        # Delete old chunks for this file
        delete_by_file(rel_path)

        # Re-chunk
        if path.suffix == '.md':
            chunks = list(chunk_markdown(rel_path, content))
        elif path.suffix in ('.yaml', '.yml') and 'openapi:' in content[:500]:
            chunks = list(chunk_openapi(rel_path, content))
        elif path.suffix == '.java':
            chunks = list(chunk_java(rel_path, content))
        else:
            return

        if not chunks:
            return

        # Embed via Kong AI
        texts = [c.content for c in chunks]
        embeddings = embed_texts(texts)

        # Upsert
        upsert_chunks(chunks, embeddings)
        print(f"Re-indexed {rel_path}: {len(chunks)} chunks")

def watch(workspace_root: str):
    handler = WorkspaceHandler(workspace_root)
    observer = Observer()
    observer.schedule(handler, workspace_root, recursive=True)
    observer.start()
    print(f"Watching {workspace_root} for changes...")
    try:
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        observer.stop()
    observer.join()

Component 7: Index Runner (Full Re-index)

Purpose: One-shot full workspace indexing.

Location: scripts/vector-db/index.py

# scripts/vector-db/index.py

import sys
import time
from chunker import chunk_workspace
from embedder import embed_texts
from store import upsert_chunks, get_stats

BATCH_SIZE = 50  # Chunks per embedding API call

def index_workspace(workspace_root: str):
    print(f"Indexing workspace: {workspace_root}")
    start = time.time()

    all_chunks = list(chunk_workspace(workspace_root))
    print(f"Chunked {len(all_chunks)} chunks from workspace")

    # Batch embed
    for i in range(0, len(all_chunks), BATCH_SIZE):
        batch = all_chunks[i:i + BATCH_SIZE]
        texts = [c.content for c in batch]
        embeddings = embed_texts(texts)
        upsert_chunks(batch, embeddings)
        print(f"  Indexed batch {i//BATCH_SIZE + 1}/{(len(all_chunks) + BATCH_SIZE - 1)//BATCH_SIZE}")

    elapsed = time.time() - start
    stats = get_stats()
    print(f"Done. {stats['total_chunks']} chunks indexed in {elapsed:.1f}s")

if __name__ == "__main__":
    root = sys.argv[1] if len(sys.argv) > 1 else "."
    index_workspace(root)

Roo Code MCP Configuration

Add to Roo Code's MCP settings (.roo/mcp.json or via Roo Code settings UI):

{
  "mcpServers": {
    "novatrek-workspace": {
      "command": "python3",
      "args": ["scripts/vector-db/mcp_server.py"],
      "env": {
        "KONG_AI_URL": "http://localhost:8000"
      }
    }
  }
}

Once configured, Roo Code will see workspace_search and workspace_index_stats as available tools and can call them autonomously during any task.


Directory Structure

scripts/vector-db/
├── README.md                 # Setup and usage instructions
├── requirements.txt          # Python dependencies
├── chunker.py                # Format-aware document chunking
├── store.py                  # ChromaDB vector storage (supports Qdrant backend)
├── embedder.py               # Embedding client (via Kong AI Gateway)
├── mcp_server.py             # MCP server for Roo Code integration
├── watcher.py                # File watcher for incremental re-indexing
├── reindex-file.py           # Single-file re-indexer (called by VS Code extension)
├── index.py                  # Full workspace indexer
└── test_chunker.py           # Unit tests for chunking logic

config/kong/
└── kong.yml                  # Kong AI Gateway declarative configuration

.vscode/
└── tasks.json                # Auto-start watcher on workspace open

.githooks/
├── post-merge                # Auto-reindex after git pull
└── post-checkout             # Auto-reindex after branch switch

.vector-db/                   # ChromaDB persistent storage (gitignored)

Dependencies

Python (scripts/vector-db/requirements.txt):

chromadb>=0.5,<1.0
mcp>=1.0,<2.0
watchdog>=4.0,<5.0
requests>=2.31,<3.0
pyyaml>=6.0,<7.0

Docker:

  • Kong Gateway 3.9+ (from kong/kong-gateway:3.9)
  • Ollama (optional, for local embeddings): docker run -d -p 11434:11434 ollama/ollama

API Keys (in .env):

export OPENAI_API_KEY=sk-...          # For text-embedding-3-small
export ANTHROPIC_API_KEY=sk-ant-...    # For Claude (if routing LLM calls through Kong)

Implementation Phases

Phase A: Foundation (Day 1-2)

Step Task Validation
A.1 Create scripts/vector-db/ directory and requirements.txt pip install -r requirements.txt succeeds
A.2 Implement chunker.py with Markdown + YAML + Java splitters Unit test: chunk a sample OpenAPI spec, verify endpoint-level splitting
A.3 Implement store.py with ChromaDB persistent storage Unit test: upsert 10 chunks, query by embedding, verify top-k results
A.4 Implement embedder.py with direct OpenAI calls (no Kong yet) Verify: embed a test string, get 1536-dim vector back
A.5 Implement index.py full workspace indexer Run against workspace, verify chunk count and ChromaDB stats
A.6 Add .vector-db/ to .gitignore Verify directory not tracked

Milestone: Full workspace indexed into local ChromaDB. Can query from Python REPL.

Phase B: MCP Server (Day 2-3)

Step Task Validation
B.1 Implement mcp_server.py with workspace_search and workspace_index_stats tools MCP inspector tool: connect and list tools
B.2 Configure Roo Code MCP connection (.roo/mcp.json) Roo Code shows "novatrek-workspace" in MCP server list
B.3 Test end-to-end: ask Roo Code "which services handle guest check-in?" and verify it calls workspace_search Agent log shows MCP tool call + relevant results
B.4 Tune top-k and chunk size based on retrieval quality Manual review of 10 test queries

Milestone: Roo Code can autonomously search the workspace vector DB during any task.

Phase C: Kong AI Gateway (Day 3-4)

Step Task Validation
C.1 Add Kong to docker-compose.yml docker compose up kong starts successfully
C.2 Create config/kong/kong.yml with embedding routes curl http://localhost:8001/services returns configured services
C.3 Update embedder.py to route through Kong (http://localhost:8000/ai/embeddings) Embeddings still work; Kong access log shows requests
C.4 Add AI Prompt Decorator plugin for agent steering LLM calls via Kong include injected system prompt
C.5 Add Rate Limiting plugin Verify 429 response when exceeding limit
C.6 Add cost tracking (AI Observability or custom logging plugin) Kong logs show token counts and estimated cost per request
C.7 Configure Ollama as local fallback When OpenAI key removed, embeddings still work via Ollama route

Milestone: All AI API traffic flows through Kong with observability, cost tracking, and rate limiting.

Phase D: File Watching + Polish (Day 4-5)

Step Task Validation
D.1 Implement watcher.py with watchdog Modify a YAML file, verify ChromaDB re-indexes within 5 seconds
D.2 Add Makefile targets for common operations make vector-index, make vector-watch, make vector-stats
D.3 Write scripts/vector-db/README.md with setup and usage instructions New team member can set up from scratch following README
D.4 Add unit tests for chunker edge cases (empty files, malformed YAML, deeply nested Markdown) All tests pass
D.5 Performance test: time full re-index, measure query latency Full index < 60s, query latency < 500ms

Milestone: Production-ready system with automatic re-indexing and developer documentation.

Phase E: Optimization (Day 5-6, optional)

Step Task Validation
E.1 Add hybrid search (vector + BM25 keyword matching) Structural queries ("services calling svc-check-in") return better results
E.2 Add chunk overlap (50-token overlap between adjacent chunks) Boundary-spanning concepts are not lost
E.3 Add file-type boosting (weight OpenAPI specs higher for API queries) API-related queries prioritize spec content
E.4 Export Kong cost metrics to a dashboard Weekly cost report for embedding + LLM calls
E.5 Configure Continue.dev as alternative MCP client Continue.dev can also query the same vector DB

Makefile Additions

# ===========================================================================
# Vector Database (Workspace Search)
# ===========================================================================

vector-index: ## Full re-index of workspace into vector DB
    python3 scripts/vector-db/index.py .

vector-watch: ## Watch workspace and re-index on file changes
    python3 scripts/vector-db/watcher.py

vector-stats: ## Show vector DB statistics
    python3 -c "from scripts.vector_db.store import get_stats; import json; print(json.dumps(get_stats(), indent=2))"

vector-search: ## Search vector DB: make vector-search Q="your query"
    python3 -c "from scripts.vector_db.embedder import embed_query; from scripts.vector_db.store import query; import json; r=query(embed_query('$(Q)')); [print(f'{m[\"file_path\"]}:{m[\"line_start\"]} ({m[\"section_heading\"]})') for m in r['metadatas'][0]]"

kong-up: ## Start Kong AI Gateway
    docker compose up kong -d

kong-logs: ## Tail Kong AI Gateway logs
    docker compose logs kong -f

kong-routes: ## List Kong AI routes
    curl -s http://localhost:8001/routes | python3 -m json.tool

Cost Projections

Initial Full Index

Metric OpenAI Embeddings Local Ollama
Estimated chunks ~3,000-5,000 Same
Avg tokens per chunk ~300 Same
Total tokens ~1,000,000-1,500,000 Same
Embedding cost $0.01-0.03 $0.00
Time (API) 30-60 seconds 2-5 minutes

Daily Operations (Incremental)

Metric OpenAI Embeddings Local Ollama
Files modified per day ~20-50 Same
Chunks re-indexed ~100-300 Same
Daily embedding cost < $0.001 $0.00
Query cost per search ~$0.000002 $0.00

Kong AI Gateway Overhead

Metric Value
Docker image size ~150 MB
Memory usage ~128-256 MB
CPU overhead per request < 1 ms (proxy latency)
Added latency per request 2-5 ms

Total Monthly Cost

Configuration Monthly Cost
OpenAI embeddings + Kong (local Docker) ~$0.50 - $1.00
Ollama local embeddings + Kong (local Docker) $0.00 (compute only)
For comparison: GitHub Copilot (includes @workspace RAG) $39.00/month

Risk Register

Risk Likelihood Impact Mitigation
ChromaDB data corruption on crash Low Medium .vector-db/ is ephemeral -- full re-index recovers in < 60 seconds
Stale embeddings return wrong context Medium Medium File watcher for auto-reindex; vector-index Makefile target for manual rebuild
Kong AI Gateway adds complexity for solo architect Medium Low Kong is optional -- embedder.py can call OpenAI directly by setting KONG_AI_URL=""
Chunking splits critical context across boundaries Medium Medium 50-token overlap in Phase E; tune chunk boundaries for domain-specific patterns
Roo Code ignores MCP tool (doesn't call workspace_search) Low High MCP tool description explicitly instructs "use this BEFORE reading files"; add to Roo Code system prompt
OpenAI API rate limits during bulk re-index Low Low Kong rate limiting prevents bursts; batch size of 50 stays well under limits
Embedding model version change alters vector space Low High Re-index entire workspace when embedding model changes (< 60 seconds)

Multi-Architect Deployment

Per-Architect Resource Model

Every component in this plan runs locally. Each architect who opens the workspace gets their own independent instance:

Component Per-architect? Why
ChromaDB (.vector-db/) Yes -- local disk ChromaDB runs as an embedded library, not a server. Each architect's checkout has its own .vector-db/ directory (gitignored). No shared state
File watcher Yes -- local process Each architect runs make vector-watch in their VS Code terminal. It watches their working copy for changes
Kong AI Gateway Yes or shared If running locally via Docker Compose (make kong-up), each architect runs their own. Could be shared via a team-hosted instance
MCP server Yes -- local process Roo Code spawns the MCP server as a child process (configured in .roo/mcp.json). It runs per VS Code window
Embeddings Shared API key All architects hit the same OpenAI/Ollama endpoint (via Kong or directly). Cost is pooled

Practical Workflow Per Architect

Architect opens VS Code
  -> Roo Code auto-starts MCP server (from .roo/mcp.json config)
  -> MCP server connects to local ChromaDB

First time (or after git pull with many changes):
  -> Run: make vector-index          # ~60 seconds, full re-index

Ongoing:
  -> Run: make vector-watch          # background, re-indexes on save
  -> (Or: VS Code task that auto-starts watcher on workspace open)

Scaling to Multiple Architects

The core limitation is that every architect maintains their own local vector DB. Three approaches address this:

Approach A: VS Code Task Auto-Start (Low Effort)

Add a .vscode/tasks.json task that auto-runs the watcher on workspace open. Each architect still has a local DB, but the watcher starts automatically with no manual step.

{
  "version": "2.0.0",
  "tasks": [
    {
      "label": "Vector DB: Watch for Changes",
      "type": "shell",
      "command": "python3",
      "args": ["scripts/vector-db/watcher.py"],
      "isBackground": true,
      "problemMatcher": [],
      "runOptions": {
        "runOn": "folderOpen"
      },
      "presentation": {
        "reveal": "silent",
        "panel": "dedicated"
      }
    },
    {
      "label": "Vector DB: Full Re-index",
      "type": "shell",
      "command": "python3",
      "args": ["scripts/vector-db/index.py", "."],
      "problemMatcher": [],
      "presentation": {
        "reveal": "always"
      }
    }
  ]
}

Effort: 30 minutes. Trade-off: Still per-architect, still needs initial index after clone.

Approach B: Git Hook Indexing (Low Effort)

Add post-checkout and post-merge git hooks that trigger a full re-index after every git pull or branch switch. Combined with the file watcher for live changes.

#!/bin/sh
# .githooks/post-merge
# Auto-reindex vector DB after git pull

if [ -d "scripts/vector-db" ] && command -v python3 >/dev/null 2>&1; then
    echo "Re-indexing workspace vector DB..."
    python3 scripts/vector-db/index.py . &
fi
#!/bin/sh
# .githooks/post-checkout
# Auto-reindex vector DB after branch switch

# Only run for branch checkouts (flag=1), not file checkouts (flag=0)
if [ "$3" = "1" ] && [ -d "scripts/vector-db" ] && command -v python3 >/dev/null 2>&1; then
    echo "Re-indexing workspace vector DB..."
    python3 scripts/vector-db/index.py . &
fi

Configure git to use the hooks directory:

git config core.hooksPath .githooks

Effort: 1 hour. Trade-off: Adds ~60 seconds (background) to every pull. Still local per architect.

Approach C: Shared Qdrant Server (Medium Effort)

Replace embedded ChromaDB with a team-hosted Qdrant instance. All architects query the same index. A CI job re-indexes on every push to main.

# Addition to docker-compose.yml (or team-hosted VM)
  qdrant:
    image: qdrant/qdrant:v1.12
    container_name: novatrek-qdrant
    ports:
      - "6333:6333"   # REST API
      - "6334:6334"   # gRPC
    volumes:
      - qdrant-data:/qdrant/storage
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
      interval: 10s
      timeout: 5s
      retries: 5

CI job (GitHub Actions):

# .github/workflows/vector-index.yml
name: Reindex Vector DB
on:
  push:
    branches: [main]
    paths:
      - 'architecture/**'
      - 'decisions/**'
      - 'portal/docs/**'
      - 'config/**'
      - 'services/**'

jobs:
  reindex:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.12'
      - run: pip install -r scripts/vector-db/requirements.txt
      - run: python3 scripts/vector-db/index.py .
        env:
          QDRANT_URL: ${{ vars.QDRANT_URL }}
          KONG_AI_URL: ${{ vars.KONG_AI_URL }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The store.py module would need a backend switch:

BACKEND = os.environ.get("VECTOR_BACKEND", "chromadb")  # chromadb | qdrant

if BACKEND == "qdrant":
    from qdrant_client import QdrantClient
    client = QdrantClient(url=os.environ.get("QDRANT_URL", "http://localhost:6333"))
else:
    import chromadb
    client = chromadb.PersistentClient(path=".vector-db")

Effort: 1-2 days. Trade-off: Requires network access and shared infrastructure. Single source of truth -- no per-architect index staleness.

Multi-Architect Recommendation

Scenario Approach Why
Solo architect (current state) A + B VS Code task auto-starts watcher; git hooks rebuild after pulls. Zero manual steps after initial setup
Team of 2-5 architects A + B Same as solo. Local DBs are acceptable when each architect's workspace diverges (feature branches)
Team of 5+ or CI-driven workflows C Shared Qdrant eliminates "every architect maintains their own DB" problem. CI-driven indexing guarantees freshness on main

VS Code Extension Analysis

Could This Be a VS Code Extension?

Yes. VS Code's extension API provides every primitive needed:

Requirement VS Code Extension API
Watch file changes vscode.workspace.onDidSaveTextDocument, vscode.workspace.onDidCreateFiles, vscode.workspace.onDidDeleteFiles
Read workspace files vscode.workspace.fs.readFile, vscode.workspace.findFiles
Background processing Extension activation on workspace open (onStartupFinished)
Status bar feedback vscode.window.createStatusBarItem -- show "Indexed 3,412 chunks"
Configuration package.json contributes settings -- embedding provider, Kong URL, chunk size
MCP server hosting Extension can spawn the MCP server as a child process, or expose tools directly
Shared state across windows globalState for cross-window persistence

A VS Code extension would eliminate every manual step:

  • No make vector-index -- the extension auto-indexes on activation
  • No make vector-watch -- file events are native to the extension lifecycle
  • No .roo/mcp.json manual config -- the extension registers the MCP server automatically
  • No separate terminal process -- everything runs inside the extension host

Should It Be a VS Code Extension?

For a solo architect or small team: No. For a distributable product: Yes.

Arguments FOR a VS Code Extension

Advantage Why it matters
Zero-touch setup Install extension, open workspace, done. No pip install, no Docker, no Makefile targets, no background terminals
Native file watching VS Code's file system events are more reliable than watchdog -- they fire for git operations, refactors, and external tools that modify files
UX integration Status bar showing index health, progress notifications during re-index, command palette commands (>Workspace Search: Reindex, >Workspace Search: Query)
Per-workspace activation Extension activates only for workspaces that need it (via activationEvents). No wasted resources
Portable Any architect installs the extension from the marketplace (or a .vsix file). No Python environment, no requirements.txt compatibility issues
Lifecycle management Extension deactivates cleanly when VS Code closes -- no orphaned watcher processes

Arguments AGAINST a VS Code Extension

Disadvantage Why it matters more
Development effort is 3-5x higher A VS Code extension requires TypeScript, webpack bundling, extension manifest, activation events, contribution points, state management. The Python scripts in this plan are ~400 lines total. An equivalent extension is ~1,500-2,500 lines of TypeScript + build config
Dependency bundling is painful ChromaDB is a Python library. A VS Code extension runs in Node.js. Options: (a) bundle a ChromaDB Python subprocess, (b) use a JavaScript vector DB like vectra or hnswlib-node, (c) HTTP calls to a ChromaDB server in Docker. None are as clean as pip install chromadb
Embedding model integration The extension would need to either bundle an embedding model (huge), call an external API (requires API key config in VS Code settings), or shell out to Python/Ollama. The Python script approach handles this natively
Testing and debugging Extension debugging requires launching a separate VS Code Extension Development Host. Python scripts can be tested with pytest in 2 seconds
Maintenance burden VS Code API changes between versions. Extension marketplace publishing has review requirements. Python scripts just work
Already solved by Continue.dev Continue.dev already IS this VS Code extension -- open source, local codebase indexing, multiple LLM backends. Building a custom extension duplicates their work

What a Custom Extension Gives You That Continue.dev Doesn't

Requirement Continue.dev Custom Extension
Workspace-wide semantic search Yes (@codebase) Yes
Format-aware chunking (OpenAPI, YAML) Partial -- generic chunking Full control
Kong AI Gateway routing No Yes
Cost tracking per query No Yes (via Kong)
MCP tool exposure for Roo Code No -- Continue.dev is its own chat Yes
NovaTrek-specific metadata enrichment No Yes

The only unique value a custom extension provides over Continue.dev is Kong AI integration and MCP tool exposure for Roo Code.

If extension-level UX is desired, build a thin VS Code extension wrapper around the existing Python scripts rather than rewriting everything in TypeScript:

VS Code Extension (TypeScript, ~200 lines)
  |-- onStartupFinished -> spawn `python3 scripts/vector-db/index.py`
  |-- onDidSaveTextDocument -> spawn `python3 scripts/vector-db/reindex-file.py <path>`
  |-- Status bar item -> reads `.vector-db/stats.json`
  |-- Command: "Reindex Workspace" -> spawns full `index.py`
  +-- Extension settings -> Kong URL, embedding provider, top-k

Python scripts (unchanged from this plan)
  |-- chunker.py, store.py, embedder.py -> actual work
  |-- mcp_server.py -> Roo Code integration
  +-- index.py, watcher.py -> invoked by extension

This gives:

  • Extension UX (auto-start, status bar, command palette)
  • Python implementation (ChromaDB native, easy to test, ~400 lines)
  • No webpack/bundling complexity for the heavy logic
  • Extension is a thin shell -- trivial to maintain

Extension Decision Matrix

If you are... Do this Why
Solo architect wanting RAG now Use the Python scripts from this plan Working in days, not weeks. make vector-index && make vector-watch is 2 commands
Solo architect who wants polish Install Continue.dev Zero development. @codebase works out of the box
Building for a team of 3-5 Python scripts + thin extension wrapper Auto-start eliminates "forgot to run the watcher" failure mode. Kong routing gives cost visibility
Building a product for distribution Full VS Code extension Only if packaging for dozens of users who cannot be expected to run Python scripts

Updated Implementation Phase (Phase F)

If the thin extension wrapper is pursued, add after Phase E:

Phase F: VS Code Extension Wrapper (Day 6-7, optional)

Step Task Validation
F.1 Scaffold VS Code extension with yo code generator Extension loads in Extension Development Host
F.2 Add onStartupFinished activation that spawns python3 scripts/vector-db/watcher.py as a child process Opening workspace starts watcher automatically
F.3 Add onDidSaveTextDocument handler that calls python3 scripts/vector-db/reindex-file.py <path> Saving a file triggers re-index within 2 seconds
F.4 Add status bar item that shows chunk count from .vector-db/stats.json Status bar displays "Vector DB: 3,412 chunks"
F.5 Add command palette: "Workspace Search: Full Reindex" Command triggers index.py with progress notification
F.6 Add extension settings for Kong URL and embedding provider Settings appear under "Workspace Search" in VS Code settings
F.7 Package as .vsix for team distribution vsce package produces installable file

Milestone: Zero-touch vector DB lifecycle -- opens with workspace, updates on save, no manual commands needed.


Success Criteria

Criterion Measurement
Full workspace indexed vector-stats reports > 2,500 chunks
Query relevance Top-3 results contain the answer for 80%+ of test queries
Query latency < 500 ms end-to-end (embed query + search + format results)
Incremental re-index Changed file re-indexed within 5 seconds of save
Kong observability Every embedding and LLM call logged with token count and cost
Agent adoption Roo Code calls workspace_search in > 50% of multi-file investigation tasks
Zero manual context Architect does not need to manually paste file contents or explain workspace structure

References