Implementation Plan: Workspace-Wide Vector Database with Kong AI Gateway¶
| Author | Christopher Blaisdell |
| Date | 2026-03-14 |
| Status | PROPOSED |
| Relates To | Vector DB / RAG Feasibility Analysis |
| Phase | Phase 2 - AI Workflow Enhancement |
| See Also | Context Window Utilization Analysis, Deep Research: Copilot vs Kong+Roo Economics |
Executive Summary¶
This plan describes how to build a workspace-wide vector database that indexes the entire NovaTrek architecture workspace and exposes it to AI coding assistants (Roo Code, Copilot, Continue.dev) via MCP protocol, with Kong AI Gateway managing all LLM and embedding API traffic for cost control, observability, prompt injection of retrieved context, and multi-provider failover.
The key insight: Kong AI Gateway cannot be the vector database or retrieval engine, but it is the correct place to manage the AI API traffic that powers the system. The architecture places each component in its proper layer:
| Layer | Component | Responsibility |
|---|---|---|
| Storage | ChromaDB (local) | Vector storage, similarity search, metadata filtering |
| Indexing | Python chunking pipeline | Format-aware document splitting, embedding generation |
| Retrieval | MCP Server (Python) | Query interface for AI agents via MCP protocol |
| AI Gateway | Kong AI Gateway | Route embedding + LLM calls, cost tracking, prompt decoration, guardrails |
| Inference | OpenAI / Anthropic / Ollama | Embedding models + LLM reasoning |
| Client | Roo Code / VS Code | AI assistant consuming the MCP tool |
Architecture¶
System Context (C4 Level 1)¶
+------------------+
| Architect |
| (VS Code User) |
+--------+---------+
|
+--------v---------+
| Roo Code / |
| Continue.dev / |
| Copilot |
+--------+---------+
| MCP Protocol
+--------v---------+
| MCP Vector Server|
| (Python, local) |
+---+----+----+----+
| | |
+-----------+ | +------------+
| | |
+--------v-------+ +----v--------+ +-----v----------+
| ChromaDB | | Kong AI GW | | File Watcher |
| (local vector | | (API proxy) | | (fswatch / |
| database) | +----+--------+ | watchdog) |
+----------------+ | +----------------+
+---+---+
| |
+--------v+ +--v----------+
| Embedding| | LLM Inference|
| Provider | | Provider |
| (OpenAI/ | | (Anthropic/ |
| Ollama) | | OpenAI) |
+----------+ +-------------+
Data Flow¶
Indexing Flow (background, on file change):
File saved in workspace
-> File watcher detects change
-> Chunking pipeline splits file (format-aware)
-> Chunks sent to Kong AI Gateway /embeddings endpoint
-> Kong routes to embedding provider (OpenAI or Ollama)
-> Kong logs: model, tokens, cost, latency
-> Embedding vectors returned
-> ChromaDB upserts vectors with metadata (file path, line range, content type)
Query Flow (on-demand, during AI agent reasoning):
AI agent calls MCP tool: search("which services call svc-guest-profiles?")
-> MCP server embeds query via Kong AI Gateway /embeddings
-> MCP server queries ChromaDB for top-k similar chunks
-> MCP server returns ranked results with file paths + content snippets
-> AI agent uses retrieved context in its reasoning
-> AI agent's LLM call routes through Kong AI Gateway /chat/completions
-> Kong logs: full request cost, latency, model, token counts
Kong AI Gateway's Role (Specifically)¶
Kong AI does not perform retrieval. It provides five critical infrastructure services for this pipeline:
| Service | How Kong AI Delivers It |
|---|---|
| Unified embedding API | Kong's AI Proxy plugin exposes a single /embeddings endpoint. Backend can be switched between OpenAI, Cohere, or local Ollama without changing any client code |
| Cost tracking | Every embedding call and every LLM inference call passes through Kong. The AI Observability plugin logs token counts, model name, latency, and estimated cost per request. This gives exact cost-per-index-run and cost-per-query metrics |
| Prompt decoration | Kong's AI Prompt Decorator plugin can inject a system prompt prefix into every LLM call (e.g., "You have access to a workspace vector search tool. Use it before reading files manually."). This steers agent behavior without modifying the AI assistant's configuration |
| Rate limiting | Kong's Rate Limiting plugin prevents runaway re-indexing from consuming excessive embedding API quota (e.g., max 1,000 embedding requests per minute) |
| Multi-provider failover | If OpenAI's embedding endpoint is down, Kong automatically routes to the fallback provider (Cohere or local Ollama) with no client-side changes |
Component Design¶
Component 1: Chunking Pipeline¶
Purpose: Split workspace files into semantically meaningful chunks suitable for embedding.
Location: scripts/vector-db/chunker.py
Format-aware chunking rules:
| File Type | Chunking Strategy | Expected Chunk Size |
|---|---|---|
Markdown (.md) | Split by H2 (##) headers. Each section becomes one chunk. Front matter (title, metadata table) stays attached to the first chunk | 200-1500 tokens |
| YAML — OpenAPI specs | Split by path + operation. Each paths./endpoint.method block becomes one chunk. info and components/schemas are separate chunks | 100-800 tokens |
| YAML — metadata files | Split by top-level key. Each capability, ticket, or event definition becomes one chunk | 50-400 tokens |
AsyncAPI (.yaml) | Split by channel. Each channel + message schema becomes one chunk | 100-500 tokens |
Java (.java) | Split by class method. Each method (with its Javadoc) becomes one chunk. Class-level annotations and imports stay with the first chunk | 100-1000 tokens |
PlantUML (.puml) | Entire file as one chunk (these are small) | 50-200 tokens |
ADR (.md) | Split by MADR section (Context, Decision Drivers, Options, Outcome, Consequences) | 100-500 tokens |
Metadata per chunk:
{
"file_path": "architecture/specs/svc-check-in.yaml",
"file_type": "openapi",
"chunk_type": "endpoint", # endpoint | schema | section | method | definition
"section_heading": "POST /check-ins",
"line_start": 45,
"line_end": 98,
"service": "svc-check-in", # extracted from path or spec content
"domain": "Operations", # from DOMAINS mapping
"last_modified": "2026-03-12T14:30:00Z"
}
Implementation:
# scripts/vector-db/chunker.py
import os
import yaml
import re
from dataclasses import dataclass
from typing import Generator
from pathlib import Path
@dataclass
class Chunk:
content: str
file_path: str
file_type: str
chunk_type: str
section_heading: str
line_start: int
line_end: int
metadata: dict
SKIP_DIRS = {'.git', 'node_modules', '.venv', 'site', '__pycache__', '.mypy_cache'}
SKIP_FILES = {'.DS_Store', '.gitignore', '.env'}
def chunk_markdown(file_path: str, content: str) -> Generator[Chunk, None, None]:
"""Split Markdown by H2 headers."""
lines = content.split('\n')
current_section = []
current_heading = "Preamble"
section_start = 1
for i, line in enumerate(lines, 1):
if line.startswith('## ') and current_section:
yield Chunk(
content='\n'.join(current_section),
file_path=file_path,
file_type='markdown',
chunk_type='section',
section_heading=current_heading,
line_start=section_start,
line_end=i - 1,
metadata={}
)
current_section = [line]
current_heading = line.lstrip('# ').strip()
section_start = i
else:
current_section.append(line)
if current_section:
yield Chunk(
content='\n'.join(current_section),
file_path=file_path,
file_type='markdown',
chunk_type='section',
section_heading=current_heading,
line_start=section_start,
line_end=len(lines),
metadata={}
)
def chunk_openapi(file_path: str, content: str) -> Generator[Chunk, None, None]:
"""Split OpenAPI YAML by path+operation."""
try:
spec = yaml.safe_load(content)
except yaml.YAMLError:
yield Chunk(
content=content,
file_path=file_path,
file_type='openapi',
chunk_type='full_file',
section_heading=os.path.basename(file_path),
line_start=1,
line_end=content.count('\n') + 1,
metadata={}
)
return
# Info block
if 'info' in spec:
info_yaml = yaml.dump({'info': spec['info']}, default_flow_style=False)
yield Chunk(
content=info_yaml,
file_path=file_path,
file_type='openapi',
chunk_type='info',
section_heading=spec.get('info', {}).get('title', 'API Info'),
line_start=1,
line_end=1,
metadata={'service': _extract_service(file_path)}
)
# Each path+operation
for path, methods in (spec.get('paths') or {}).items():
for method, operation in methods.items():
if method.startswith('x-'):
continue
op_yaml = yaml.dump(
{path: {method: operation}},
default_flow_style=False
)
summary = operation.get('summary', f'{method.upper()} {path}')
yield Chunk(
content=op_yaml,
file_path=file_path,
file_type='openapi',
chunk_type='endpoint',
section_heading=f'{method.upper()} {path} -- {summary}',
line_start=1,
line_end=1,
metadata={'service': _extract_service(file_path)}
)
# Schemas
schemas = (spec.get('components') or {}).get('schemas') or {}
for name, schema in schemas.items():
schema_yaml = yaml.dump({name: schema}, default_flow_style=False)
yield Chunk(
content=schema_yaml,
file_path=file_path,
file_type='openapi',
chunk_type='schema',
section_heading=f'Schema: {name}',
line_start=1,
line_end=1,
metadata={'service': _extract_service(file_path)}
)
def chunk_java(file_path: str, content: str) -> Generator[Chunk, None, None]:
"""Split Java by method boundaries."""
# Simplified: split on method-level patterns
lines = content.split('\n')
method_pattern = re.compile(
r'^\s+(public|private|protected)\s+\S+\s+\w+\s*\('
)
current_block = []
block_start = 1
current_heading = os.path.basename(file_path)
for i, line in enumerate(lines, 1):
if method_pattern.match(line) and current_block:
yield Chunk(
content='\n'.join(current_block),
file_path=file_path,
file_type='java',
chunk_type='method',
section_heading=current_heading,
line_start=block_start,
line_end=i - 1,
metadata={}
)
current_block = [line]
current_heading = line.strip()
block_start = i
else:
current_block.append(line)
if current_block:
yield Chunk(
content='\n'.join(current_block),
file_path=file_path,
file_type='java',
chunk_type='method',
section_heading=current_heading,
line_start=block_start,
line_end=len(lines),
metadata={}
)
def _extract_service(file_path: str) -> str:
"""Extract service name from file path."""
parts = Path(file_path).parts
for part in parts:
if part.startswith('svc-'):
return part
stem = Path(file_path).stem
if stem.startswith('svc-'):
return stem
return ''
def chunk_workspace(workspace_root: str) -> Generator[Chunk, None, None]:
"""Walk workspace and yield all chunks."""
for dirpath, dirnames, filenames in os.walk(workspace_root):
dirnames[:] = [d for d in dirnames if d not in SKIP_DIRS]
for filename in filenames:
if filename in SKIP_FILES:
continue
file_path = os.path.join(dirpath, filename)
rel_path = os.path.relpath(file_path, workspace_root)
try:
with open(file_path, 'r', encoding='utf-8', errors='ignore') as f:
content = f.read()
except (PermissionError, IsADirectoryError):
continue
if not content.strip():
continue
ext = Path(filename).suffix.lower()
if ext == '.md':
yield from chunk_markdown(rel_path, content)
elif ext in ('.yaml', '.yml'):
# Detect OpenAPI vs plain YAML
if 'openapi:' in content[:500]:
yield from chunk_openapi(rel_path, content)
else:
# Plain YAML: single chunk
yield Chunk(
content=content,
file_path=rel_path,
file_type='yaml',
chunk_type='full_file',
section_heading=filename,
line_start=1,
line_end=content.count('\n') + 1,
metadata={}
)
elif ext == '.java':
yield from chunk_java(rel_path, content)
elif ext in ('.puml', '.plantuml'):
yield Chunk(
content=content,
file_path=rel_path,
file_type='plantuml',
chunk_type='full_file',
section_heading=filename,
line_start=1,
line_end=content.count('\n') + 1,
metadata={}
)
Component 2: ChromaDB Vector Store¶
Purpose: Store embeddings locally with metadata filtering and similarity search.
Location: scripts/vector-db/store.py
Why ChromaDB:
| Criterion | ChromaDB | LanceDB | Qdrant | FAISS |
|---|---|---|---|---|
| Local-first (no server needed) | Yes (persistent mode) | Yes | Needs Docker | Yes |
| Metadata filtering | Yes | Yes | Yes | No |
| Python SDK quality | Excellent | Good | Good | Minimal |
| Incremental upsert | Yes (by ID) | Yes | Yes | Manual |
| Hybrid search (vector + keyword) | Yes (with where_document) | No | Yes | No |
| Disk footprint | <100 MB for this workspace | <50 MB | ~200 MB (Docker) | <50 MB |
| Setup complexity | pip install chromadb | pip install lancedb | Docker container | pip install faiss-cpu |
Implementation:
# scripts/vector-db/store.py
import chromadb
import hashlib
from pathlib import Path
DB_PATH = ".vector-db"
COLLECTION_NAME = "novatrek-workspace"
def get_collection():
client = chromadb.PersistentClient(path=DB_PATH)
return client.get_or_create_collection(
name=COLLECTION_NAME,
metadata={"hnsw:space": "cosine"}
)
def chunk_id(file_path: str, line_start: int, section_heading: str) -> str:
"""Deterministic ID for upsert idempotency."""
raw = f"{file_path}:{line_start}:{section_heading}"
return hashlib.sha256(raw.encode()).hexdigest()[:16]
def upsert_chunks(chunks, embeddings):
"""Upsert chunk embeddings into ChromaDB."""
collection = get_collection()
ids = []
documents = []
metadatas = []
for chunk in chunks:
ids.append(chunk_id(chunk.file_path, chunk.line_start, chunk.section_heading))
documents.append(chunk.content)
metadatas.append({
"file_path": chunk.file_path,
"file_type": chunk.file_type,
"chunk_type": chunk.chunk_type,
"section_heading": chunk.section_heading,
"line_start": chunk.line_start,
"line_end": chunk.line_end,
**chunk.metadata
})
# ChromaDB handles batching internally
collection.upsert(
ids=ids,
embeddings=embeddings,
documents=documents,
metadatas=metadatas
)
def query(embedding, top_k=5, where_filter=None):
"""Query ChromaDB for similar chunks."""
collection = get_collection()
kwargs = {
"query_embeddings": [embedding],
"n_results": top_k,
"include": ["documents", "metadatas", "distances"]
}
if where_filter:
kwargs["where"] = where_filter
return collection.query(**kwargs)
def delete_by_file(file_path: str):
"""Remove all chunks for a file (before re-indexing)."""
collection = get_collection()
collection.delete(where={"file_path": file_path})
def get_stats():
"""Return collection statistics."""
collection = get_collection()
return {
"total_chunks": collection.count(),
"collection": COLLECTION_NAME
}
Component 3: Kong AI Gateway (Docker)¶
Purpose: Central proxy for all embedding and LLM API calls. Provides cost tracking, rate limiting, prompt decoration, and multi-provider failover.
Location: Added to docker-compose.yml
Docker Compose addition:
# ---------------------------------------------------------------------------
# Kong AI Gateway (manages embedding + LLM API traffic)
# ---------------------------------------------------------------------------
kong:
image: kong/kong-gateway:3.9
container_name: novatrek-kong-ai
environment:
KONG_DATABASE: "off"
KONG_DECLARATIVE_CONFIG: /etc/kong/kong.yml
KONG_PROXY_LISTEN: "0.0.0.0:8000"
KONG_ADMIN_LISTEN: "0.0.0.0:8001"
KONG_LOG_LEVEL: info
ports:
- "8000:8000" # Proxy (AI API calls go here)
- "8001:8001" # Admin API
volumes:
- ./config/kong/kong.yml:/etc/kong/kong.yml:ro
healthcheck:
test: ["CMD", "kong", "health"]
interval: 10s
timeout: 5s
retries: 5
Kong declarative config (config/kong/kong.yml):
_format_version: "3.0"
services:
# ===== Embedding Provider: OpenAI =====
- name: openai-embeddings
url: https://api.openai.com/v1
routes:
- name: embeddings-route
paths:
- /ai/embeddings
strip_path: true
plugins:
- name: ai-proxy
config:
route_type: llm/v1/embeddings
auth:
header_name: Authorization
header_value: "Bearer ${OPENAI_API_KEY}"
model:
provider: openai
name: text-embedding-3-small
- name: rate-limiting
config:
minute: 500
policy: local
- name: ai-prompt-decorator
config:
prepend:
- role: system
content: >
You are indexing an architecture workspace for the NovaTrek
Adventures platform. Embeddings are used for semantic search
over OpenAPI specs, ADRs, solution designs, and service metadata.
# ===== Embedding Provider: Ollama (local fallback) =====
- name: ollama-embeddings
url: http://host.docker.internal:11434/v1
routes:
- name: embeddings-local-route
paths:
- /ai/embeddings/local
strip_path: true
plugins:
- name: rate-limiting
config:
minute: 1000
policy: local
# ===== LLM Inference: Anthropic (for Roo Code RAG-augmented calls) =====
- name: anthropic-chat
url: https://api.anthropic.com/v1
routes:
- name: chat-route
paths:
- /ai/chat
strip_path: true
plugins:
- name: ai-proxy
config:
route_type: llm/v1/chat
auth:
header_name: x-api-key
header_value: "${ANTHROPIC_API_KEY}"
model:
provider: anthropic
name: claude-sonnet-4-20250514
- name: rate-limiting
config:
minute: 100
policy: local
- name: ai-prompt-decorator
config:
prepend:
- role: system
content: >
You have access to a workspace vector search tool via MCP.
When investigating architecture questions, query the vector
database before reading files directly. The workspace contains
19 microservice OpenAPI specs, 11 ADRs, event schemas, and
solution designs for the NovaTrek Adventures platform.
# ===== LLM Inference: OpenAI (fallback) =====
- name: openai-chat
url: https://api.openai.com/v1
routes:
- name: chat-fallback-route
paths:
- /ai/chat/openai
strip_path: true
plugins:
- name: ai-proxy
config:
route_type: llm/v1/chat
auth:
header_name: Authorization
header_value: "Bearer ${OPENAI_API_KEY}"
model:
provider: openai
name: gpt-4.1
- name: rate-limiting
config:
minute: 100
policy: local
What Kong AI tracks for every request:
{
"request.model": "text-embedding-3-small",
"request.provider": "openai",
"response.tokens.input": 1523,
"response.tokens.output": 0,
"response.latency_ms": 142,
"response.cost_usd": 0.0000305,
"consumer": "vector-indexer",
"route": "embeddings-route",
"timestamp": "2026-03-14T15:30:42Z"
}
Component 4: Embedding Client (via Kong AI)¶
Purpose: Generate embeddings by calling Kong AI Gateway's unified /ai/embeddings endpoint.
Location: scripts/vector-db/embedder.py
# scripts/vector-db/embedder.py
import os
import requests
from typing import Optional
KONG_BASE_URL = os.environ.get("KONG_AI_URL", "http://localhost:8000")
def embed_texts(texts: list[str], provider: str = "openai") -> list[list[float]]:
"""Generate embeddings via Kong AI Gateway."""
if provider == "local":
url = f"{KONG_BASE_URL}/ai/embeddings/local"
payload = {
"model": "nomic-embed-text",
"input": texts
}
else:
url = f"{KONG_BASE_URL}/ai/embeddings"
payload = {
"model": "text-embedding-3-small",
"input": texts
}
response = requests.post(url, json=payload, timeout=30)
response.raise_for_status()
data = response.json()
return [item["embedding"] for item in data["data"]]
def embed_query(query: str, provider: str = "openai") -> list[float]:
"""Embed a single query string."""
return embed_texts([query], provider=provider)[0]
Component 5: MCP Server¶
Purpose: Expose vector search as an MCP tool that Roo Code (and other MCP-compatible clients) can call autonomously.
Location: scripts/vector-db/mcp_server.py
MCP Tool Definition:
{
"name": "workspace_search",
"description": "Semantic search across the entire NovaTrek architecture workspace. Searches OpenAPI specs, ADRs, solution designs, event schemas, capability metadata, and Java source code. Returns the top-k most relevant chunks with file paths and line numbers. Use this BEFORE reading files to find relevant context efficiently.",
"inputSchema": {
"type": "object",
"properties": {
"query": {
"type": "string",
},
"top_k": {
"type": "integer",
"default": 5,
"description": "Number of results to return (1-20)"
},
"file_type": {
"type": "string",
"enum": ["markdown", "openapi", "yaml", "java", "plantuml"],
"description": "Optional filter to restrict search to a specific file type"
},
"service": {
"type": "string",
"description": "Optional filter to restrict search to a specific service (e.g., 'svc-check-in')"
}
},
"required": ["query"]
}
}
Implementation:
# scripts/vector-db/mcp_server.py
import asyncio
import json
from mcp.server import Server
from mcp.server.stdio import stdio_server
from mcp.types import Tool, TextContent
from store import query as vector_query, get_stats
from embedder import embed_query
app = Server("novatrek-workspace-search")
@app.list_tools()
async def list_tools():
return [
Tool(
name="workspace_search",
description=(
"Semantic search across the entire NovaTrek architecture workspace. "
"Searches OpenAPI specs, ADRs, solution designs, event schemas, "
"capability metadata, and Java source code. Returns the top-k most "
"relevant chunks with file paths and line numbers. "
"Use this BEFORE reading files to find relevant context efficiently."
),
inputSchema={
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Natural language search query"
},
"top_k": {
"type": "integer",
"default": 5,
"description": "Number of results to return (1-20)"
},
"file_type": {
"type": "string",
"enum": ["markdown", "openapi", "yaml", "java", "plantuml"],
"description": "Optional: restrict to file type"
},
"service": {
"type": "string",
"description": "Optional: restrict to service (e.g., svc-check-in)"
}
},
"required": ["query"]
}
),
Tool(
name="workspace_index_stats",
description="Get statistics about the workspace vector index",
inputSchema={
"type": "object",
"properties": {}
}
)
]
@app.call_tool()
async def call_tool(name: str, arguments: dict):
if name == "workspace_search":
query_text = arguments["query"]
top_k = min(arguments.get("top_k", 5), 20)
# Build metadata filter
where_filter = {}
if "file_type" in arguments:
where_filter["file_type"] = arguments["file_type"]
if "service" in arguments:
where_filter["service"] = arguments["service"]
# Embed query via Kong AI Gateway
query_embedding = embed_query(query_text)
# Search ChromaDB
results = vector_query(
embedding=query_embedding,
top_k=top_k,
where_filter=where_filter if where_filter else None
)
# Format results
output_lines = [f"## Search Results for: \"{query_text}\"\n"]
for i, (doc, meta, dist) in enumerate(zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)):
score = 1 - dist # cosine distance to similarity
output_lines.append(
f"### Result {i+1} (similarity: {score:.3f})\n"
f"**File:** {meta['file_path']} "
f"(lines {meta['line_start']}-{meta['line_end']})\n"
f"**Type:** {meta['file_type']} / {meta['chunk_type']}\n"
f"**Section:** {meta['section_heading']}\n\n"
f"```\n{doc[:500]}{'...' if len(doc) > 500 else ''}\n```\n"
)
return [TextContent(type="text", text='\n'.join(output_lines))]
elif name == "workspace_index_stats":
stats = get_stats()
return [TextContent(
type="text",
text=json.dumps(stats, indent=2)
)]
return [TextContent(type="text", text=f"Unknown tool: {name}")]
async def main():
async with stdio_server() as (read_stream, write_stream):
await app.run(read_stream, write_stream)
if __name__ == "__main__":
asyncio.run(main())
Component 6: File Watcher (Incremental Re-indexing)¶
Purpose: Detect file changes and re-index only the modified files.
Location: scripts/vector-db/watcher.py
# scripts/vector-db/watcher.py
import time
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler
from pathlib import Path
from chunker import chunk_markdown, chunk_openapi, chunk_java, SKIP_DIRS
from embedder import embed_texts
from store import upsert_chunks, delete_by_file
WATCH_EXTENSIONS = {'.md', '.yaml', '.yml', '.java', '.puml'}
class WorkspaceHandler(FileSystemEventHandler):
def __init__(self, workspace_root: str):
self.workspace_root = workspace_root
def on_modified(self, event):
if event.is_directory:
return
self._reindex(event.src_path)
def on_created(self, event):
if event.is_directory:
return
self._reindex(event.src_path)
def on_deleted(self, event):
if event.is_directory:
return
rel_path = str(Path(event.src_path).relative_to(self.workspace_root))
delete_by_file(rel_path)
def _reindex(self, abs_path: str):
path = Path(abs_path)
if path.suffix not in WATCH_EXTENSIONS:
return
if any(skip in path.parts for skip in SKIP_DIRS):
return
rel_path = str(path.relative_to(self.workspace_root))
try:
content = path.read_text(encoding='utf-8', errors='ignore')
except (PermissionError, FileNotFoundError):
return
if not content.strip():
return
# Delete old chunks for this file
delete_by_file(rel_path)
# Re-chunk
if path.suffix == '.md':
chunks = list(chunk_markdown(rel_path, content))
elif path.suffix in ('.yaml', '.yml') and 'openapi:' in content[:500]:
chunks = list(chunk_openapi(rel_path, content))
elif path.suffix == '.java':
chunks = list(chunk_java(rel_path, content))
else:
return
if not chunks:
return
# Embed via Kong AI
texts = [c.content for c in chunks]
embeddings = embed_texts(texts)
# Upsert
upsert_chunks(chunks, embeddings)
print(f"Re-indexed {rel_path}: {len(chunks)} chunks")
def watch(workspace_root: str):
handler = WorkspaceHandler(workspace_root)
observer = Observer()
observer.schedule(handler, workspace_root, recursive=True)
observer.start()
print(f"Watching {workspace_root} for changes...")
try:
while True:
time.sleep(1)
except KeyboardInterrupt:
observer.stop()
observer.join()
Component 7: Index Runner (Full Re-index)¶
Purpose: One-shot full workspace indexing.
Location: scripts/vector-db/index.py
# scripts/vector-db/index.py
import sys
import time
from chunker import chunk_workspace
from embedder import embed_texts
from store import upsert_chunks, get_stats
BATCH_SIZE = 50 # Chunks per embedding API call
def index_workspace(workspace_root: str):
print(f"Indexing workspace: {workspace_root}")
start = time.time()
all_chunks = list(chunk_workspace(workspace_root))
print(f"Chunked {len(all_chunks)} chunks from workspace")
# Batch embed
for i in range(0, len(all_chunks), BATCH_SIZE):
batch = all_chunks[i:i + BATCH_SIZE]
texts = [c.content for c in batch]
embeddings = embed_texts(texts)
upsert_chunks(batch, embeddings)
print(f" Indexed batch {i//BATCH_SIZE + 1}/{(len(all_chunks) + BATCH_SIZE - 1)//BATCH_SIZE}")
elapsed = time.time() - start
stats = get_stats()
print(f"Done. {stats['total_chunks']} chunks indexed in {elapsed:.1f}s")
if __name__ == "__main__":
root = sys.argv[1] if len(sys.argv) > 1 else "."
index_workspace(root)
Roo Code MCP Configuration¶
Add to Roo Code's MCP settings (.roo/mcp.json or via Roo Code settings UI):
{
"mcpServers": {
"novatrek-workspace": {
"command": "python3",
"args": ["scripts/vector-db/mcp_server.py"],
"env": {
"KONG_AI_URL": "http://localhost:8000"
}
}
}
}
Once configured, Roo Code will see workspace_search and workspace_index_stats as available tools and can call them autonomously during any task.
Directory Structure¶
scripts/vector-db/
├── README.md # Setup and usage instructions
├── requirements.txt # Python dependencies
├── chunker.py # Format-aware document chunking
├── store.py # ChromaDB vector storage (supports Qdrant backend)
├── embedder.py # Embedding client (via Kong AI Gateway)
├── mcp_server.py # MCP server for Roo Code integration
├── watcher.py # File watcher for incremental re-indexing
├── reindex-file.py # Single-file re-indexer (called by VS Code extension)
├── index.py # Full workspace indexer
└── test_chunker.py # Unit tests for chunking logic
config/kong/
└── kong.yml # Kong AI Gateway declarative configuration
.vscode/
└── tasks.json # Auto-start watcher on workspace open
.githooks/
├── post-merge # Auto-reindex after git pull
└── post-checkout # Auto-reindex after branch switch
.vector-db/ # ChromaDB persistent storage (gitignored)
Dependencies¶
Python (scripts/vector-db/requirements.txt):
Docker:
- Kong Gateway 3.9+ (from
kong/kong-gateway:3.9) - Ollama (optional, for local embeddings):
docker run -d -p 11434:11434 ollama/ollama
API Keys (in .env):
export OPENAI_API_KEY=sk-... # For text-embedding-3-small
export ANTHROPIC_API_KEY=sk-ant-... # For Claude (if routing LLM calls through Kong)
Implementation Phases¶
Phase A: Foundation (Day 1-2)¶
| Step | Task | Validation |
|---|---|---|
| A.1 | Create scripts/vector-db/ directory and requirements.txt | pip install -r requirements.txt succeeds |
| A.2 | Implement chunker.py with Markdown + YAML + Java splitters | Unit test: chunk a sample OpenAPI spec, verify endpoint-level splitting |
| A.3 | Implement store.py with ChromaDB persistent storage | Unit test: upsert 10 chunks, query by embedding, verify top-k results |
| A.4 | Implement embedder.py with direct OpenAI calls (no Kong yet) | Verify: embed a test string, get 1536-dim vector back |
| A.5 | Implement index.py full workspace indexer | Run against workspace, verify chunk count and ChromaDB stats |
| A.6 | Add .vector-db/ to .gitignore | Verify directory not tracked |
Milestone: Full workspace indexed into local ChromaDB. Can query from Python REPL.
Phase B: MCP Server (Day 2-3)¶
| Step | Task | Validation |
|---|---|---|
| B.1 | Implement mcp_server.py with workspace_search and workspace_index_stats tools | MCP inspector tool: connect and list tools |
| B.2 | Configure Roo Code MCP connection (.roo/mcp.json) | Roo Code shows "novatrek-workspace" in MCP server list |
| B.3 | Test end-to-end: ask Roo Code "which services handle guest check-in?" and verify it calls workspace_search | Agent log shows MCP tool call + relevant results |
| B.4 | Tune top-k and chunk size based on retrieval quality | Manual review of 10 test queries |
Milestone: Roo Code can autonomously search the workspace vector DB during any task.
Phase C: Kong AI Gateway (Day 3-4)¶
| Step | Task | Validation |
|---|---|---|
| C.1 | Add Kong to docker-compose.yml | docker compose up kong starts successfully |
| C.2 | Create config/kong/kong.yml with embedding routes | curl http://localhost:8001/services returns configured services |
| C.3 | Update embedder.py to route through Kong (http://localhost:8000/ai/embeddings) | Embeddings still work; Kong access log shows requests |
| C.4 | Add AI Prompt Decorator plugin for agent steering | LLM calls via Kong include injected system prompt |
| C.5 | Add Rate Limiting plugin | Verify 429 response when exceeding limit |
| C.6 | Add cost tracking (AI Observability or custom logging plugin) | Kong logs show token counts and estimated cost per request |
| C.7 | Configure Ollama as local fallback | When OpenAI key removed, embeddings still work via Ollama route |
Milestone: All AI API traffic flows through Kong with observability, cost tracking, and rate limiting.
Phase D: File Watching + Polish (Day 4-5)¶
| Step | Task | Validation |
|---|---|---|
| D.1 | Implement watcher.py with watchdog | Modify a YAML file, verify ChromaDB re-indexes within 5 seconds |
| D.2 | Add Makefile targets for common operations | make vector-index, make vector-watch, make vector-stats |
| D.3 | Write scripts/vector-db/README.md with setup and usage instructions | New team member can set up from scratch following README |
| D.4 | Add unit tests for chunker edge cases (empty files, malformed YAML, deeply nested Markdown) | All tests pass |
| D.5 | Performance test: time full re-index, measure query latency | Full index < 60s, query latency < 500ms |
Milestone: Production-ready system with automatic re-indexing and developer documentation.
Phase E: Optimization (Day 5-6, optional)¶
| Step | Task | Validation |
|---|---|---|
| E.1 | Add hybrid search (vector + BM25 keyword matching) | Structural queries ("services calling svc-check-in") return better results |
| E.2 | Add chunk overlap (50-token overlap between adjacent chunks) | Boundary-spanning concepts are not lost |
| E.3 | Add file-type boosting (weight OpenAPI specs higher for API queries) | API-related queries prioritize spec content |
| E.4 | Export Kong cost metrics to a dashboard | Weekly cost report for embedding + LLM calls |
| E.5 | Configure Continue.dev as alternative MCP client | Continue.dev can also query the same vector DB |
Makefile Additions¶
# ===========================================================================
# Vector Database (Workspace Search)
# ===========================================================================
vector-index: ## Full re-index of workspace into vector DB
python3 scripts/vector-db/index.py .
vector-watch: ## Watch workspace and re-index on file changes
python3 scripts/vector-db/watcher.py
vector-stats: ## Show vector DB statistics
python3 -c "from scripts.vector_db.store import get_stats; import json; print(json.dumps(get_stats(), indent=2))"
vector-search: ## Search vector DB: make vector-search Q="your query"
python3 -c "from scripts.vector_db.embedder import embed_query; from scripts.vector_db.store import query; import json; r=query(embed_query('$(Q)')); [print(f'{m[\"file_path\"]}:{m[\"line_start\"]} ({m[\"section_heading\"]})') for m in r['metadatas'][0]]"
kong-up: ## Start Kong AI Gateway
docker compose up kong -d
kong-logs: ## Tail Kong AI Gateway logs
docker compose logs kong -f
kong-routes: ## List Kong AI routes
curl -s http://localhost:8001/routes | python3 -m json.tool
Cost Projections¶
Initial Full Index¶
| Metric | OpenAI Embeddings | Local Ollama |
|---|---|---|
| Estimated chunks | ~3,000-5,000 | Same |
| Avg tokens per chunk | ~300 | Same |
| Total tokens | ~1,000,000-1,500,000 | Same |
| Embedding cost | $0.01-0.03 | $0.00 |
| Time (API) | 30-60 seconds | 2-5 minutes |
Daily Operations (Incremental)¶
| Metric | OpenAI Embeddings | Local Ollama |
|---|---|---|
| Files modified per day | ~20-50 | Same |
| Chunks re-indexed | ~100-300 | Same |
| Daily embedding cost | < $0.001 | $0.00 |
| Query cost per search | ~$0.000002 | $0.00 |
Kong AI Gateway Overhead¶
| Metric | Value |
|---|---|
| Docker image size | ~150 MB |
| Memory usage | ~128-256 MB |
| CPU overhead per request | < 1 ms (proxy latency) |
| Added latency per request | 2-5 ms |
Total Monthly Cost¶
| Configuration | Monthly Cost |
|---|---|
| OpenAI embeddings + Kong (local Docker) | ~$0.50 - $1.00 |
| Ollama local embeddings + Kong (local Docker) | $0.00 (compute only) |
| For comparison: GitHub Copilot (includes @workspace RAG) | $39.00/month |
Risk Register¶
| Risk | Likelihood | Impact | Mitigation |
|---|---|---|---|
| ChromaDB data corruption on crash | Low | Medium | .vector-db/ is ephemeral -- full re-index recovers in < 60 seconds |
| Stale embeddings return wrong context | Medium | Medium | File watcher for auto-reindex; vector-index Makefile target for manual rebuild |
| Kong AI Gateway adds complexity for solo architect | Medium | Low | Kong is optional -- embedder.py can call OpenAI directly by setting KONG_AI_URL="" |
| Chunking splits critical context across boundaries | Medium | Medium | 50-token overlap in Phase E; tune chunk boundaries for domain-specific patterns |
Roo Code ignores MCP tool (doesn't call workspace_search) | Low | High | MCP tool description explicitly instructs "use this BEFORE reading files"; add to Roo Code system prompt |
| OpenAI API rate limits during bulk re-index | Low | Low | Kong rate limiting prevents bursts; batch size of 50 stays well under limits |
| Embedding model version change alters vector space | Low | High | Re-index entire workspace when embedding model changes (< 60 seconds) |
Multi-Architect Deployment¶
Per-Architect Resource Model¶
Every component in this plan runs locally. Each architect who opens the workspace gets their own independent instance:
| Component | Per-architect? | Why |
|---|---|---|
ChromaDB (.vector-db/) | Yes -- local disk | ChromaDB runs as an embedded library, not a server. Each architect's checkout has its own .vector-db/ directory (gitignored). No shared state |
| File watcher | Yes -- local process | Each architect runs make vector-watch in their VS Code terminal. It watches their working copy for changes |
| Kong AI Gateway | Yes or shared | If running locally via Docker Compose (make kong-up), each architect runs their own. Could be shared via a team-hosted instance |
| MCP server | Yes -- local process | Roo Code spawns the MCP server as a child process (configured in .roo/mcp.json). It runs per VS Code window |
| Embeddings | Shared API key | All architects hit the same OpenAI/Ollama endpoint (via Kong or directly). Cost is pooled |
Practical Workflow Per Architect¶
Architect opens VS Code
-> Roo Code auto-starts MCP server (from .roo/mcp.json config)
-> MCP server connects to local ChromaDB
First time (or after git pull with many changes):
-> Run: make vector-index # ~60 seconds, full re-index
Ongoing:
-> Run: make vector-watch # background, re-indexes on save
-> (Or: VS Code task that auto-starts watcher on workspace open)
Scaling to Multiple Architects¶
The core limitation is that every architect maintains their own local vector DB. Three approaches address this:
Approach A: VS Code Task Auto-Start (Low Effort)¶
Add a .vscode/tasks.json task that auto-runs the watcher on workspace open. Each architect still has a local DB, but the watcher starts automatically with no manual step.
{
"version": "2.0.0",
"tasks": [
{
"label": "Vector DB: Watch for Changes",
"type": "shell",
"command": "python3",
"args": ["scripts/vector-db/watcher.py"],
"isBackground": true,
"problemMatcher": [],
"runOptions": {
"runOn": "folderOpen"
},
"presentation": {
"reveal": "silent",
"panel": "dedicated"
}
},
{
"label": "Vector DB: Full Re-index",
"type": "shell",
"command": "python3",
"args": ["scripts/vector-db/index.py", "."],
"problemMatcher": [],
"presentation": {
"reveal": "always"
}
}
]
}
Effort: 30 minutes. Trade-off: Still per-architect, still needs initial index after clone.
Approach B: Git Hook Indexing (Low Effort)¶
Add post-checkout and post-merge git hooks that trigger a full re-index after every git pull or branch switch. Combined with the file watcher for live changes.
#!/bin/sh
# .githooks/post-merge
# Auto-reindex vector DB after git pull
if [ -d "scripts/vector-db" ] && command -v python3 >/dev/null 2>&1; then
echo "Re-indexing workspace vector DB..."
python3 scripts/vector-db/index.py . &
fi
#!/bin/sh
# .githooks/post-checkout
# Auto-reindex vector DB after branch switch
# Only run for branch checkouts (flag=1), not file checkouts (flag=0)
if [ "$3" = "1" ] && [ -d "scripts/vector-db" ] && command -v python3 >/dev/null 2>&1; then
echo "Re-indexing workspace vector DB..."
python3 scripts/vector-db/index.py . &
fi
Configure git to use the hooks directory:
Effort: 1 hour. Trade-off: Adds ~60 seconds (background) to every pull. Still local per architect.
Approach C: Shared Qdrant Server (Medium Effort)¶
Replace embedded ChromaDB with a team-hosted Qdrant instance. All architects query the same index. A CI job re-indexes on every push to main.
# Addition to docker-compose.yml (or team-hosted VM)
qdrant:
image: qdrant/qdrant:v1.12
container_name: novatrek-qdrant
ports:
- "6333:6333" # REST API
- "6334:6334" # gRPC
volumes:
- qdrant-data:/qdrant/storage
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:6333/healthz"]
interval: 10s
timeout: 5s
retries: 5
CI job (GitHub Actions):
# .github/workflows/vector-index.yml
name: Reindex Vector DB
on:
push:
branches: [main]
paths:
- 'architecture/**'
- 'decisions/**'
- 'portal/docs/**'
- 'config/**'
- 'services/**'
jobs:
reindex:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-python@v5
with:
python-version: '3.12'
- run: pip install -r scripts/vector-db/requirements.txt
- run: python3 scripts/vector-db/index.py .
env:
QDRANT_URL: ${{ vars.QDRANT_URL }}
KONG_AI_URL: ${{ vars.KONG_AI_URL }}
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
The store.py module would need a backend switch:
BACKEND = os.environ.get("VECTOR_BACKEND", "chromadb") # chromadb | qdrant
if BACKEND == "qdrant":
from qdrant_client import QdrantClient
client = QdrantClient(url=os.environ.get("QDRANT_URL", "http://localhost:6333"))
else:
import chromadb
client = chromadb.PersistentClient(path=".vector-db")
Effort: 1-2 days. Trade-off: Requires network access and shared infrastructure. Single source of truth -- no per-architect index staleness.
Multi-Architect Recommendation¶
| Scenario | Approach | Why |
|---|---|---|
| Solo architect (current state) | A + B | VS Code task auto-starts watcher; git hooks rebuild after pulls. Zero manual steps after initial setup |
| Team of 2-5 architects | A + B | Same as solo. Local DBs are acceptable when each architect's workspace diverges (feature branches) |
| Team of 5+ or CI-driven workflows | C | Shared Qdrant eliminates "every architect maintains their own DB" problem. CI-driven indexing guarantees freshness on main |
VS Code Extension Analysis¶
Could This Be a VS Code Extension?¶
Yes. VS Code's extension API provides every primitive needed:
| Requirement | VS Code Extension API |
|---|---|
| Watch file changes | vscode.workspace.onDidSaveTextDocument, vscode.workspace.onDidCreateFiles, vscode.workspace.onDidDeleteFiles |
| Read workspace files | vscode.workspace.fs.readFile, vscode.workspace.findFiles |
| Background processing | Extension activation on workspace open (onStartupFinished) |
| Status bar feedback | vscode.window.createStatusBarItem -- show "Indexed 3,412 chunks" |
| Configuration | package.json contributes settings -- embedding provider, Kong URL, chunk size |
| MCP server hosting | Extension can spawn the MCP server as a child process, or expose tools directly |
| Shared state across windows | globalState for cross-window persistence |
A VS Code extension would eliminate every manual step:
- No
make vector-index-- the extension auto-indexes on activation - No
make vector-watch-- file events are native to the extension lifecycle - No
.roo/mcp.jsonmanual config -- the extension registers the MCP server automatically - No separate terminal process -- everything runs inside the extension host
Should It Be a VS Code Extension?¶
For a solo architect or small team: No. For a distributable product: Yes.
Arguments FOR a VS Code Extension¶
| Advantage | Why it matters |
|---|---|
| Zero-touch setup | Install extension, open workspace, done. No pip install, no Docker, no Makefile targets, no background terminals |
| Native file watching | VS Code's file system events are more reliable than watchdog -- they fire for git operations, refactors, and external tools that modify files |
| UX integration | Status bar showing index health, progress notifications during re-index, command palette commands (>Workspace Search: Reindex, >Workspace Search: Query) |
| Per-workspace activation | Extension activates only for workspaces that need it (via activationEvents). No wasted resources |
| Portable | Any architect installs the extension from the marketplace (or a .vsix file). No Python environment, no requirements.txt compatibility issues |
| Lifecycle management | Extension deactivates cleanly when VS Code closes -- no orphaned watcher processes |
Arguments AGAINST a VS Code Extension¶
| Disadvantage | Why it matters more |
|---|---|
| Development effort is 3-5x higher | A VS Code extension requires TypeScript, webpack bundling, extension manifest, activation events, contribution points, state management. The Python scripts in this plan are ~400 lines total. An equivalent extension is ~1,500-2,500 lines of TypeScript + build config |
| Dependency bundling is painful | ChromaDB is a Python library. A VS Code extension runs in Node.js. Options: (a) bundle a ChromaDB Python subprocess, (b) use a JavaScript vector DB like vectra or hnswlib-node, (c) HTTP calls to a ChromaDB server in Docker. None are as clean as pip install chromadb |
| Embedding model integration | The extension would need to either bundle an embedding model (huge), call an external API (requires API key config in VS Code settings), or shell out to Python/Ollama. The Python script approach handles this natively |
| Testing and debugging | Extension debugging requires launching a separate VS Code Extension Development Host. Python scripts can be tested with pytest in 2 seconds |
| Maintenance burden | VS Code API changes between versions. Extension marketplace publishing has review requirements. Python scripts just work |
| Already solved by Continue.dev | Continue.dev already IS this VS Code extension -- open source, local codebase indexing, multiple LLM backends. Building a custom extension duplicates their work |
What a Custom Extension Gives You That Continue.dev Doesn't¶
| Requirement | Continue.dev | Custom Extension |
|---|---|---|
| Workspace-wide semantic search | Yes (@codebase) | Yes |
| Format-aware chunking (OpenAPI, YAML) | Partial -- generic chunking | Full control |
| Kong AI Gateway routing | No | Yes |
| Cost tracking per query | No | Yes (via Kong) |
| MCP tool exposure for Roo Code | No -- Continue.dev is its own chat | Yes |
| NovaTrek-specific metadata enrichment | No | Yes |
The only unique value a custom extension provides over Continue.dev is Kong AI integration and MCP tool exposure for Roo Code.
Recommended Hybrid Path: Thin Extension Wrapper¶
If extension-level UX is desired, build a thin VS Code extension wrapper around the existing Python scripts rather than rewriting everything in TypeScript:
VS Code Extension (TypeScript, ~200 lines)
|-- onStartupFinished -> spawn `python3 scripts/vector-db/index.py`
|-- onDidSaveTextDocument -> spawn `python3 scripts/vector-db/reindex-file.py <path>`
|-- Status bar item -> reads `.vector-db/stats.json`
|-- Command: "Reindex Workspace" -> spawns full `index.py`
+-- Extension settings -> Kong URL, embedding provider, top-k
Python scripts (unchanged from this plan)
|-- chunker.py, store.py, embedder.py -> actual work
|-- mcp_server.py -> Roo Code integration
+-- index.py, watcher.py -> invoked by extension
This gives:
- Extension UX (auto-start, status bar, command palette)
- Python implementation (ChromaDB native, easy to test, ~400 lines)
- No webpack/bundling complexity for the heavy logic
- Extension is a thin shell -- trivial to maintain
Extension Decision Matrix¶
| If you are... | Do this | Why |
|---|---|---|
| Solo architect wanting RAG now | Use the Python scripts from this plan | Working in days, not weeks. make vector-index && make vector-watch is 2 commands |
| Solo architect who wants polish | Install Continue.dev | Zero development. @codebase works out of the box |
| Building for a team of 3-5 | Python scripts + thin extension wrapper | Auto-start eliminates "forgot to run the watcher" failure mode. Kong routing gives cost visibility |
| Building a product for distribution | Full VS Code extension | Only if packaging for dozens of users who cannot be expected to run Python scripts |
Updated Implementation Phase (Phase F)¶
If the thin extension wrapper is pursued, add after Phase E:
Phase F: VS Code Extension Wrapper (Day 6-7, optional)¶
| Step | Task | Validation |
|---|---|---|
| F.1 | Scaffold VS Code extension with yo code generator | Extension loads in Extension Development Host |
| F.2 | Add onStartupFinished activation that spawns python3 scripts/vector-db/watcher.py as a child process | Opening workspace starts watcher automatically |
| F.3 | Add onDidSaveTextDocument handler that calls python3 scripts/vector-db/reindex-file.py <path> | Saving a file triggers re-index within 2 seconds |
| F.4 | Add status bar item that shows chunk count from .vector-db/stats.json | Status bar displays "Vector DB: 3,412 chunks" |
| F.5 | Add command palette: "Workspace Search: Full Reindex" | Command triggers index.py with progress notification |
| F.6 | Add extension settings for Kong URL and embedding provider | Settings appear under "Workspace Search" in VS Code settings |
| F.7 | Package as .vsix for team distribution | vsce package produces installable file |
Milestone: Zero-touch vector DB lifecycle -- opens with workspace, updates on save, no manual commands needed.
Success Criteria¶
| Criterion | Measurement |
|---|---|
| Full workspace indexed | vector-stats reports > 2,500 chunks |
| Query relevance | Top-3 results contain the answer for 80%+ of test queries |
| Query latency | < 500 ms end-to-end (embed query + search + format results) |
| Incremental re-index | Changed file re-indexed within 5 seconds of save |
| Kong observability | Every embedding and LLM call logged with token count and cost |
| Agent adoption | Roo Code calls workspace_search in > 50% of multi-file investigation tasks |
| Zero manual context | Architect does not need to manually paste file contents or explain workspace structure |
References¶
- Vector DB / RAG Feasibility Analysis -- Feasibility assessment and tool comparison
- Context Window Utilization Analysis -- Why better retrieval matters
- Deep Research: Copilot vs Kong+Roo Economics -- Token economics and RAG architecture comparison
- Kong AI Gateway Docs: https://docs.konghq.com/gateway/latest/ai-gateway/
- ChromaDB Docs: https://docs.trychroma.com/
- MCP Specification: https://spec.modelcontextprotocol.io/
- Ollama: https://ollama.ai/