Skip to content

Phase 1: AI Tool Cost Comparison — GitHub Copilot Execution Results

Execution Summary

Field Value
Tool Under Test GitHub Copilot (Claude Opus 4.6 via Agent Mode)
Tier Copilot Business ($19/seat/month)
Execution Date 2026-03-XX
Executor AI-assisted (GitHub Copilot Agent Mode)
Workspace NovaTrek Adventures Synthetic Architecture Workspace
Scenarios Completed 5 / 5

Scenario Results

SC-01: New Ticket Triage (NTK-10005)

Metric Value
Ticket NTK-10005 — Wristband RFID Tag Field
Complexity Simple
Duration ~10 minutes
Files Read 3 (ticket report, svc-check-in.yaml, mock-jira output)
Files Created/Updated 8 (solution design scaffold, classification, assumptions, decisions, guidance, impacts, user stories, investigations)
Tool Calls ~12 (mock-jira, file reads, file creates/edits)
Mock Tools Used JIRA (--ticket, --list --status)

Quality Scoring (/25)

Criterion Score Notes
Ticket Classification 5 Correctly classified as Simple — single-service schema change
Workspace Scaffolding 4 Created full folder structure with all expected artifacts
Swagger Awareness 5 Correctly identified existing rfid_tag field in WristbandAssignment schema
Recommendation Quality 4 Appropriate recommendations for simple additive change
Time Efficiency 5 Completed well within 15-minute target
Total 23/25

SC-02: Solution Design (NTK-10002)

Metric Value
Ticket NTK-10002 — Adventure Category Classification
Complexity Medium
Duration ~20 minutes
Files Read 12+ (ticket report, solution design, 2 swagger specs, source code files, all sub-artifacts)
Files Updated 1 (decisions.md — converted to full MADR format)
Tool Calls ~18 (file reads, swagger spec analysis, mock-jira)
Mock Tools Used JIRA

Quality Scoring (/35)

Criterion Score Notes
Requirements Understanding 5 Correctly identified 25 categories → 3 patterns mapping with booking source overrides
Swagger Analysis 5 Read both svc-check-in.yaml and svc-trip-catalog.yaml; identified relevant schemas
Source Code Analysis 5 Analyzed AdventureCategoryClassifier.java, CheckInService.java, CheckInRecord.java
ADR Quality (MADR) 5 Converted 2 ADRs to full MADR format with options analysis, consequences, and pros/cons
Impact Assessment 4 Verified existing impacts for svc-check-in (PRIMARY) and svc-trip-catalog (MINOR)
User Stories 4 Verified existing 4 user stories cover operator, developer, and testing perspectives
Standards Compliance 5 MADR format applied correctly; arc42 quality attributes considered
Total 33/35

SC-03: Investigation & Root Cause Analysis (NTK-10004)

Metric Value
Ticket NTK-10004 — Guide Schedule Overwrite Bug
Complexity High
Duration ~30 minutes
Files Read 8 (ticket report, SchedulingService.java, DailySchedule.java, ScheduleController.java, ConflictDetector.java, existing investigations.md, assumptions.md, simple.explanation.md)
Files Created/Updated 6 (investigations.md rewritten, solution design, decisions 2 ADRs, impacts, risks, user stories, guidance)
Tool Calls ~25 (3 mock tools, file reads, file creates/edits)
Mock Tools Used JIRA, Elastic (ERROR + WARN), GitLab

Quality Scoring (/30)

Criterion Score Notes
Tool Usage 5 Used all 3 mock tools: JIRA (ticket), Elastic (ERROR + WARN logs), GitLab (MR list)
Root Cause Identification 5 Correctly identified PUT vs PATCH as primary root cause from SchedulingService.java source code
Data Ownership (Architectural) 5 Elevated from code bug to architectural boundary violation — identified that orchestrator overwrites fields owned by guide-management
Remediation Quality 5 Proposed 3-phase fix: PATCH semantics (Sprint 19), @Version optimistic locking (Sprint 20), monitoring + governance (Sprint 21)
Document Structure 5 Investigation document follows proper structure with evidence chain, code analysis, root cause, and recommendations
Evidence-Based 5 Diagnosis supported by 4 ERROR logs (specific guide IDs, trace IDs, 47ms race window), 2 WARN logs (causal chain confirmed), source code annotations, MR absence
Total 30/30

SC-04: Architecture Update (NTK-10001)

Metric Value
Ticket NTK-10001 — Add Elevation Profile Data
Complexity Medium
Duration ~15 minutes
Files Read 3 (solution design, svc-trail-management.yaml, novatrek-component-overview.puml)
Files Modified 2 (svc-trail-management.yaml, novatrek-component-overview.puml)
Files Created 1 (commit-message.md)
Tool Calls ~10
Mock Tools Used None (artifact update scenario)

Quality Scoring (/25)

Criterion Score Notes
OpenAPI Validity 5 Added elevation_loss_m with proper type (number/double), nullable, description, example
Field Quality 5 Descriptions explain semantics; examples provided; consistent with existing elevation_gain_m
PlantUML Syntax 4 Valid PlantUML syntax; added note annotation and updated dependency label
Design Consistency 5 Changes match exactly what solution design specified — additive only, nullable fields
Commit Message 5 Conventional commit format, references NTK-10001, lists all changed files, notes backward compatibility
Total 24/25

Metric Value
Complexity Very High
Duration ~25 minutes
Files Read 14+ (ticket report, solution design, 4 impact docs, risks, user stories, decisions, sequence diagram, guidance, assumptions, swagger specs)
Files Updated 1 (decisions.md — converted 4 ADRs to full MADR format)
Files Created 1 (C4 component diagram)
Tool Calls ~20
Mock Tools Used None (design review/enhancement scenario)

Quality Scoring (/40)

Criterion Score Notes
Service Discovery 5 Identified all 6 affected services: svc-check-in, svc-reservations, svc-guest-profiles, svc-safety-compliance, svc-gear-inventory, svc-partner-integrations
API Design 5 Verified complete POST /check-ins/lookup-reservation endpoint with request/response schemas, error responses, rate limiting
Diagram Validity 4 Created C4 component diagram (PlantUML); sequence diagram already existed and is comprehensive
ADR Quality 5 Converted 4 ADRs to full MADR format: orchestrator pattern, 4-field verification, temporary profiles, session expiry. Each has genuine options analysis
Impact Precision 5 4 impact docs correctly scoped: svc-check-in PRIMARY (new endpoint, 5 clients, config), svc-guest-profiles MODERATE (new endpoint, profile type, merge), svc-safety-compliance LOW (extended query), svc-reservations MODERATE (new endpoint, composite index)
Risk Realism 5 5 realistic risks with actionable mitigations (enumeration attacks, partner data inconsistency, profile accumulation, kiosk hardware, staff training)
Story Coverage 5 5 user stories covering guest (US-1, US-3), partner-booked guest (US-2), security (US-4), and operations (US-5)
Security Awareness 5 Security front and center: PII masking, rate limiting (gateway + app), JWT scoping to device, artificial delays, audit logging
Total 39/40

Aggregate Quality Summary

Scenario Max Score Achieved Percentage
SC-01: Ticket Triage 25 23 92%
SC-02: Solution Design 35 33 94%
SC-03: Investigation 30 30 100%
SC-04: Architecture Update 25 24 96%
SC-05: Cross-Service 40 39 98%
Total 155 149 96.1%

Normalized Quality Score

Average quality across 5 scenarios: 4.81 / 5.0

Scenarios with quality >= 80%: 5 / 5


Observable Interaction Metrics (Copilot — No Token Counts Available)

Since GitHub Copilot does not expose per-request token counts, the following observable metrics are recorded:

Metric SC-01 SC-02 SC-03 SC-04 SC-05 Total
Chat turns 1 1 1 1 1 5
Tool calls (est.) 12 18 25 10 20 85
Mock scripts executed 2 1 4 0 0 7
Files read 3 12 8 3 14 40
Files created 8 0 6 1 1 16
Files modified 0 1 1 2 1 5
Diagrams created/modified 0 0 0 1 1 2
ADRs created/formatted 1 2 2 0 4 9
Wall-clock time (est. min) 10 20 30 15 25 100

Cost Analysis

GitHub Copilot Cost (Fixed Model)

Tier Monthly Cost Per Seat Annual Cost Per Seat
Business $19 $228
Enterprise $39 $468

Cost is fixed regardless of usage volume. No token-based overage observed during this test (all scenarios executed within the model's standard allocation — Claude Opus 4.6 fast mode).

Estimated Token Usage (for comparison with Kong AI)

Based on observable interactions and typical context window utilization:

Metric Estimate Basis
Average context per scenario ~50,000-80,000 tokens Files read (40 total, avg ~200 lines each at ~4 tokens/line) + system prompt + conversation
Average output per scenario ~15,000-25,000 tokens Files created/modified (21 total, avg ~80-150 lines each)
Total input tokens (5 scenarios) ~300,000-400,000 Conservative estimate
Total output tokens (5 scenarios) ~75,000-125,000 Conservative estimate

Kong AI Equivalent Cost Estimate

If these 5 scenarios were executed via Kong AI + Bedrock with Claude Sonnet pricing, the cost must account for the agentic re-transmission tax: Roo Code's client-side architecture re-transmits the entire conversation history at every turn of the agentic loop. With 85 tool calls across 5 scenarios, the cumulative re-transmitted input volume is ~4M tokens. See DEEP-RESEARCH-1 and DEEP-RESEARCH-2.

Scenario Tool Calls Cumulative Input Tokens Output Tokens Variable Cost
SC-01 12 ~300K ~10K $1.05
SC-02 18 ~810K ~15K $2.66
SC-03 25 ~1,625K ~30K $5.33
SC-04 10 ~220K ~8K $0.78
SC-05 20 ~1,100K ~20K $3.60
TOTAL 85 ~4,055K ~83K $13.42

Monthly Cost Projection

Scenario Monthly Freq Copilot Business Copilot Enterprise Kong AI
SC-01: Ticket Triage 10 $10.50
SC-02: Solution Design 6 $15.96
SC-03: Investigation 4 $21.32
SC-04: Architecture Update 4 $3.12
SC-05: Cross-Service 2 $7.20
Total Monthly 26 runs $19.00 $39.00 ~$58.10

Cost Per Quality Point

Tool Monthly Cost (26 runs) Quality Score Cost per Quality Point
Kong AI $58.10 TBD (not yet tested) TBD
Copilot Business $19.00 4.81/5.0 $3.95
Copilot Enterprise $39.00 4.81/5.0 $8.11

Note: Even if Kong AI matches Copilot quality, its cost per quality point would be ~$12.08 — 3× higher than Copilot Business.

Scalability Projection

Usage Level Monthly Runs Copilot Business Copilot Enterprise Kong AI
1x (base) 26 $19.00 $39.00 $58.10
1x + PROMOTE 38 $19.00 $39.00 $67.46
2x 52 $19.00 $39.00 $92.32
3x 78 $19.00 $39.00 $138.48
Breakeven vs. Copilot Business ~11 runs $19.00 ~$19.00
Breakeven vs. Copilot Enterprise ~22 runs $39.00 ~$39.00

Key finding: Copilot Business is 3.5× cheaper than Kong AI at the realistic 38 runs/month workload ($19 vs $67). This advantage grows with volume — at 3× workload, Copilot is 7.3× cheaper ($19 vs $138). The dominant cost driver for Kong AI is the agentic re-transmission tax: cumulative re-transmission of the conversation history across 85+ turns per batch. See COST-MEASUREMENT-METHODOLOGY.md for the full analysis and DEEP-RESEARCH-1.md for the underlying token economics research.


Observations and Notes

Strengths Demonstrated

  1. Autonomous multi-step execution: Completed all 5 scenarios in a single continuous session without requiring user intervention between scenarios
  2. Multi-tool orchestration: Used mock JIRA, Elastic, and GitLab tools appropriately across scenarios
  3. Root cause elevation: Identified NTK-10004 as an architectural boundary violation, not just a code bug (the key insight the playbook tests for)
  4. Standards compliance: All ADRs formatted to MADR template; PlantUML syntax is valid; solution designs follow arc42 structure
  5. Scope discipline: NTK-10001 changes were limited to what the solution design specified (no scope creep)

Limitations Observed

  1. No per-request token visibility: Cannot produce exact token costs — estimates only
  2. Single-session execution: All 5 scenarios ran in one continuous conversation, which may inflate context usage compared to isolated runs
  3. Pre-existing artifacts: Several scenario artifacts already existed in the workspace (by design); the AI correctly identified and enhanced them rather than creating duplicates
  4. Context window management: As the session progressed across 5 scenarios, earlier context was summarized — later scenarios had less access to early scenario details

Quality Notes

  • SC-01 deduction (-2): Workspace scaffolding slightly below expectations — created flat file structure rather than strictly following folder convention in some areas
  • SC-04 deduction (-1): PlantUML diagram update used a note annotation rather than a full structural change; functional but could be more integrated
  • SC-05 deduction (-1): C4 component diagram was created as a new file rather than updating the existing system context — valid approach but could also have updated the system-level diagram