Skip to content

ADR-001: AI Toolchain Selection for Architecture Practice

Status PROPOSED
Date 2026-03-01
Last Updated 2026-03-04
Decision Makers Christopher Blaisdell, Architecture Practice
Phase Phase 1 - AI Tool Cost Comparison

Context and Problem Statement

The Architecture Practice needs AI-assisted tooling to accelerate solution architecture workflows — from ticket triage and investigation through solution design, review, and publishing. Two viable options exist. We need to select one as the standard toolchain for the practice, balancing cost, quality, standards compliance, and operational fit.

Which AI toolchain should the Architecture Practice adopt for AI-assisted solution architecture work?

Decision Drivers

  • Monthly cost per architect seat: The practice has multiple architects; per-seat cost must be defensible to leadership
  • Architecture output quality: AI-generated artifacts must meet arc42, C4, and MADR standards without excessive manual correction
  • Standards compliance: The toolchain must be configurable to enforce organizational architecture standards automatically
  • Workflow integration: The toolchain must integrate with the existing VS Code-based architecture workflow (DocFlow, PlantUML, Markdown)
  • Extensibility: The toolchain must support future pipeline integration (Phase 3) and custom tooling
  • Model flexibility: The ability to select and switch between LLM models as pricing and capabilities evolve
  • Corporate governance: The toolchain must operate within corporate security, procurement, and data handling policies

Considered Options

Option A: Roo Code + Kong AI Gateway

Description: Roo Code is a free, open-source VS Code extension that provides AI-assisted coding and architecture support through configurable modes and custom instructions. Kong AI Gateway routes LLM API requests through an enterprise API gateway to backend model providers (AWS Bedrock with Claude models).

Pricing model: Usage-based. Cost is determined by actual token consumption routed through Kong AI to AWS Bedrock. No per-seat software license fee.

Cost formula:

Monthly Cost Per Seat =
  (Monthly Input Tokens x Bedrock Input Token Price)
  + (Monthly Output Tokens x Bedrock Output Token Price)
  + Kong AI Gateway operational cost allocation (if any)

Strengths: - Cost scales with actual usage — light months cost less - Full model flexibility — can switch between Claude Sonnet, Haiku, Opus, or other providers - Custom instruction system (.roo/rules/) enables fine-grained standards enforcement per mode - Open source — no vendor lock-in on the extension - Kong AI Gateway already exists in the corporate infrastructure - Supports MCP (Model Context Protocol) for custom tool integration

Weaknesses: - Cost is unpredictable — heavy months could exceed flat-rate alternatives - Requires internal Kong AI team to maintain the gateway - No built-in ecosystem (no GitHub integration, no PR review, no code suggestions outside of chat) - Configuration complexity — custom instructions require ongoing maintenance

Option B: GitHub Copilot (Business or Enterprise)

Description: GitHub Copilot is a commercial AI assistant integrated into VS Code with chat, inline suggestions, agent mode, and extensions. Available at two tiers: Business ($19/seat/month) and Enterprise ($39/seat/month).

Pricing model: Flat per-seat monthly subscription. Includes a base allocation of premium model requests (Claude Sonnet, GPT-4o) with potential overage charges.

Cost formula:

Monthly Cost Per Seat =
  Subscription fee ($19 or $39 per seat)
  + Premium model request overage (if applicable)

Strengths: - Predictable monthly cost — easy to budget - Deep GitHub integration (PR reviews, code suggestions, repository context) - Large ecosystem (extensions, agent mode, workspace instructions via .github/copilot-instructions.md) - Minimal setup — works out of the box with VS Code - Enterprise tier includes organization-wide policy controls - Backed by Microsoft/GitHub — enterprise support and compliance

Weaknesses: - Per-seat cost applies regardless of usage volume — light users pay the same as heavy users - Model selection limited to what GitHub offers (currently Claude Sonnet, GPT-4o, and Copilot's own models) - Customization limited to workspace instructions and prompt engineering — no mode-based instruction system - Premium model request limits may require overage tracking - Vendor lock-in to the GitHub ecosystem

Evaluation Criteria

Phase 1 will evaluate both options across 5 representative architecture scenarios using the synthetic NovaTrek Adventures workspace. Each scenario will be scored on the following criteria:

Criterion Weight Measurement Method
Monthly cost per seat 30% Token usage extrapolated to monthly volume at current pricing
Architecture output quality 25% Architect-scored rubric (1-5) per scenario
Standards compliance rate 20% Pass/fail checklist against arc42, C4, MADR rules
Manual corrections required 15% Count of edits needed after AI generation
Workflow integration friction 10% Qualitative assessment of setup, configuration, and daily use

Evaluation Scenarios

Scenario What It Tests
Ticket intake and classification AI's ability to parse a ticket, classify architectural relevance, scaffold workspace
Current state investigation AI's ability to analyze Swagger specs, source code, and logs to produce investigation docs
Solution design creation AI's ability to produce arc42-compliant designs with impacts, decisions, and diagrams
Merge request review AI's ability to validate spec/diagram changes against a solution design
Publishing preparation AI's ability to validate cross-references, formatting, and standards compliance

See phase-1-ai-tool-cost-comparison/AI-TOOL-COST-COMPARISON-PLAN.md for full scenario playbook details.

Decision Outcome

Selected option: TBD — pending Roo Code quality scoring for final comparison

GitHub Copilot (Claude Opus 4.6, Agent Mode) and Roo Code (Claude Opus 4.6, OpenRouter) Phase 1 executions are complete. Actual billing data has been collected. The table below is populated with actual costs and run 001 Copilot quality scores.

Roo Code + OpenRouter GitHub Copilot Business GitHub Copilot Pro+
Actual run cost (5 scenarios) ~$100 (auto-top-up data, Mar 4) N/A (not tested) $0.48 (4 user prompts x 3x x $0.04)
Monthly run cost (38 runs) ~$507 (extrapolated) $19/seat (flat) $39/seat (flat, 1500 req included)
Actual overage charged $100 (pay-per-token) N/A $0 (within 1500 included requests)
Token/usage cost model Per-token via OpenRouter Flat subscription Per user prompt x model multiplier x $0.04
Platform cost $0 (SaaS, no gateway) $19/seat $39/seat
Tool license cost $0 (open source) (included) (included)
Total monthly per seat ~$507 $19 $39
Cost ratio vs cheapest ~27x 1x (baseline) 2x
Per-run cost ratio (Copilot as baseline) ~208x 1x
SC-01 quality (/25) TBD 23 (92%)
SC-02 quality (/35) TBD 33 (94%)
SC-03 quality (/30) TBD 30 (100%)
SC-04 quality (/25) TBD 24 (96%)
SC-05 quality (/40) TBD 39 (98%)
Total quality (/155) TBD 149 (96.1%)
Cost per quality point (monthly) ~$507/TBD = TBD $19/TBD $39/149 = $0.26
Scenarios with quality >= 80% TBD 5/5

Revised Cost at Realistic Workload (Actual Billing Data, Deep Research Corrected)

The per-run cost estimates from the deep research used Claude Sonnet pricing ($3.00/1M input, $15.00/1M output), projecting ~$1.78/run. Actual billing data from run 002 (March 4, 2026) reveals the true cost of Claude Opus 4.6 via OpenRouter is ~$100/run — roughly 7.5x higher than the Sonnet-based estimate.

Deep research on Copilot billing (DEEP-RESEARCH-RESULTS-COPILOT-BILLING.md) resolved that GitHub Copilot bills per user prompt, not per model turn. Autonomous tool calls are free. The correct session cost is $0.48 (4 user prompts x 3x multiplier x $0.04).

Metric Roo Code + OpenRouter Copilot Business Copilot Pro+
Actual per-run cost ~$100 ~$0.50 (est.) $0.48 (4 prompts x 3 x $0.04)
Monthly runs (with PROMOTE) ~38 ~38 ~38
Monthly notional cost (38 runs) ~$507 $19.00 $39.00 ($18.24 notional usage, within included allowance)
Cost per run ~$13.35 avg ~$0.50 ~$0.48
Runs per month within included allowance N/A N/A ~125 (1,500 / 12 req per run)
Cost trend as volume grows Increases linearly Flat Flat until ~125 runs/month
vs. Copilot Pro+ (per run) ~208x more expensive ~1x
vs. Copilot Business ~27x more expensive (monthly) 2x more expensive

Billing Evidence (March 4, 2026)

OpenRouter (Roo Code): 4 auto-top-up charges of $25 each between 10:11 AM and 10:37 AM = $100 consumed in 26 minutes during run 002 execution.

GitHub Copilot Pro+: 120 premium requests at $0.04 each = $4.80 notional for the entire day across all projects. $0 overage (within 1,500 included monthly allowance).

The PROMOTE step (updating corporate baselines after deployment) adds ~12 runs/month to the workload. At this revised volume, Copilot Pro+ is ~208x cheaper per run than OpenRouter. Even at the monthly subscription level, Copilot is ~13x cheaper. The gap is so large that quality scores would need to be dramatically different (Roo Code achieving near-perfect scores while Copilot scored below 5%) for OpenRouter to be cost-effective on a per-quality-point basis.

Preliminary Observations (Copilot Completed, Kong AI Pending)

GitHub Copilot demonstrated: - 96.1% quality score across all 5 scenarios (149/155) - Autonomous multi-step execution — all scenarios completed in a single session - Correct architectural reasoning (data ownership violation identification in SC-03) - MADR-compliant ADR generation (9 ADRs created/formatted) - Valid PlantUML diagram generation (2 diagrams created/modified) - All 3 mock tools used appropriately across scenarios

Limitations: - No per-request token visibility — cost estimates are approximations - Context window management summarized early context during long session - Fixed cost model means light months still cost $19/seat regardless of usage

Revised Cost Analysis (Actual Billing Data + Deep Research, 2026-03-04):

Actual billing data from run 002 execution, corrected by deep research findings: - OpenRouter (Claude Opus 4.6): ~$100/run (4 x $25 auto-top-ups in 26 minutes) - Copilot Pro+: $0.48 per run (4 user prompts x 3x multiplier x $0.04); $4.80 notional for the full day (120 premium requests across all projects); $0 overage - At 38 runs/month: OpenRouter = ~$507/month, Copilot Pro+ = $39/month (all 38 runs within included 1,500 req allowance) - Copilot is ~208x cheaper per run ($0.48 vs ~$100) - Copilot is ~13x cheaper monthly ($39 vs ~$507)

The previous estimates of ~$2.80/run for Copilot were based on dividing the daily total (120 requests) by assumed per-turn billing. Deep research confirmed billing is per user prompt only — the entire autonomous tool-call loop is free. See DEEP-RESEARCH-RESULTS-COPILOT-BILLING.md for the full analysis with 39 cited sources.

Two critical risks with the OpenRouter stack remain from the deep research: 1. Infinite retry loop: Context-length errors may trigger uncontrolled retries 2. Rate limiting race condition: Post-response async token counting blocks context condensing

Positive Consequences

  • Copilot demonstrated production-ready architecture artifact generation (96.1% quality)
  • 9 MADR-formatted ADRs created across 5 scenarios — minimal manual correction expected
  • Autonomous multi-step execution reduces architect time investment per scenario
  • Flat-rate pricing provides budget predictability for practice leadership

Negative Consequences

  • No per-request token visibility limits cost optimization opportunities
  • Fixed cost regardless of usage — light months still cost $19-$39/seat
  • Model selection limited to what GitHub Copilot offers (currently Claude Opus 4.6 for agent mode)
  • Single-session context window management may degrade quality across very long sessions

Additional Considerations

Future Phase Impact

The selected toolchain will be used throughout all subsequent phases: - Phase 2: AI instructions and workflow design will be built for the selected tool - Phase 3: Pipeline integration may leverage tool-specific APIs (e.g., Roo Code MCP servers, Copilot extensions) - Phase 4: Artifact graph generation may use AI for relationship discovery - Phase 5: Continuous improvement metrics will be tool-specific

A toolchain switch after Phase 2 would require significant rework. This makes Phase 1's evaluation critical.

Hybrid Approach

It is possible that the evaluation reveals complementary strengths (e.g., Copilot for code-level suggestions, Roo Code + Kong AI for architecture-level generation). If the data supports it, a hybrid recommendation may be appropriate, though this increases operational complexity.