Phase 1: AI Tool Cost Comparison — GitHub Copilot Execution Results¶

Execution Summary¶

Field	Value
Tool Under Test	GitHub Copilot (Claude Opus 4.6 via Agent Mode)
Tier	Copilot Business ($19/seat/month)
Execution Date	2026-03-XX
Executor	AI-assisted (GitHub Copilot Agent Mode)
Workspace	NovaTrek Adventures Synthetic Architecture Workspace
Scenarios Completed	5 / 5

Scenario Results¶

SC-01: New Ticket Triage (NTK-10005)¶

Metric	Value
Ticket	NTK-10005 — Wristband RFID Tag Field
Complexity	Simple
Duration	~10 minutes
Files Read	3 (ticket report, svc-check-in.yaml, mock-jira output)
Files Created/Updated	8 (solution design scaffold, classification, assumptions, decisions, guidance, impacts, user stories, investigations)
Tool Calls	~12 (mock-jira, file reads, file creates/edits)
Mock Tools Used	JIRA (--ticket, --list --status)

Quality Scoring (/25)¶

Criterion	Score	Notes
Ticket Classification	5	Correctly classified as Simple — single-service schema change
Workspace Scaffolding	4	Created full folder structure with all expected artifacts
Swagger Awareness	5	Correctly identified existing rfid_tag field in WristbandAssignment schema
Recommendation Quality	4	Appropriate recommendations for simple additive change
Time Efficiency	5	Completed well within 15-minute target
Total	23/25

SC-02: Solution Design (NTK-10002)¶

Metric	Value
Ticket	NTK-10002 — Adventure Category Classification
Complexity	Medium
Duration	~20 minutes
Files Read	12+ (ticket report, solution design, 2 swagger specs, source code files, all sub-artifacts)
Files Updated	1 (decisions.md — converted to full MADR format)
Tool Calls	~18 (file reads, swagger spec analysis, mock-jira)
Mock Tools Used	JIRA

Quality Scoring (/35)¶

Criterion	Score	Notes
Requirements Understanding	5	Correctly identified 25 categories → 3 patterns mapping with booking source overrides
Swagger Analysis	5	Read both svc-check-in.yaml and svc-trip-catalog.yaml; identified relevant schemas
Source Code Analysis	5	Analyzed AdventureCategoryClassifier.java, CheckInService.java, CheckInRecord.java
ADR Quality (MADR)	5	Converted 2 ADRs to full MADR format with options analysis, consequences, and pros/cons
Impact Assessment	4	Verified existing impacts for svc-check-in (PRIMARY) and svc-trip-catalog (MINOR)
User Stories	4	Verified existing 4 user stories cover operator, developer, and testing perspectives
Standards Compliance	5	MADR format applied correctly; arc42 quality attributes considered
Total	33/35

SC-03: Investigation & Root Cause Analysis (NTK-10004)¶

Metric	Value
Ticket	NTK-10004 — Guide Schedule Overwrite Bug
Complexity	High
Duration	~30 minutes
Files Read	8 (ticket report, SchedulingService.java, DailySchedule.java, ScheduleController.java, ConflictDetector.java, existing investigations.md, assumptions.md, simple.explanation.md)
Files Created/Updated	6 (investigations.md rewritten, solution design, decisions 2 ADRs, impacts, risks, user stories, guidance)
Tool Calls	~25 (3 mock tools, file reads, file creates/edits)
Mock Tools Used	JIRA, Elastic (ERROR + WARN), GitLab

Quality Scoring (/30)¶

Criterion	Score	Notes
Tool Usage	5	Used all 3 mock tools: JIRA (ticket), Elastic (ERROR + WARN logs), GitLab (MR list)
Root Cause Identification	5	Correctly identified PUT vs PATCH as primary root cause from SchedulingService.java source code
Data Ownership (Architectural)	5	Elevated from code bug to architectural boundary violation — identified that orchestrator overwrites fields owned by guide-management
Remediation Quality	5	Proposed 3-phase fix: PATCH semantics (Sprint 19), @Version optimistic locking (Sprint 20), monitoring + governance (Sprint 21)
Document Structure	5	Investigation document follows proper structure with evidence chain, code analysis, root cause, and recommendations
Evidence-Based	5	Diagnosis supported by 4 ERROR logs (specific guide IDs, trace IDs, 47ms race window), 2 WARN logs (causal chain confirmed), source code annotations, MR absence
Total	30/30

SC-04: Architecture Update (NTK-10001)¶

Metric	Value
Ticket	NTK-10001 — Add Elevation Profile Data
Complexity	Medium
Duration	~15 minutes
Files Read	3 (solution design, svc-trail-management.yaml, novatrek-component-overview.puml)
Files Modified	2 (svc-trail-management.yaml, novatrek-component-overview.puml)
Files Created	1 (commit-message.md)
Tool Calls	~10
Mock Tools Used	None (artifact update scenario)

Quality Scoring (/25)¶

Criterion	Score	Notes
OpenAPI Validity	5	Added `elevation_loss_m` with proper type (number/double), nullable, description, example
Field Quality	5	Descriptions explain semantics; examples provided; consistent with existing `elevation_gain_m`
PlantUML Syntax	4	Valid PlantUML syntax; added note annotation and updated dependency label
Design Consistency	5	Changes match exactly what solution design specified — additive only, nullable fields
Commit Message	5	Conventional commit format, references NTK-10001, lists all changed files, notes backward compatibility
Total	24/25

Metric	Value
Complexity	Very High
Duration	~25 minutes
Files Read	14+ (ticket report, solution design, 4 impact docs, risks, user stories, decisions, sequence diagram, guidance, assumptions, swagger specs)
Files Updated	1 (decisions.md — converted 4 ADRs to full MADR format)
Files Created	1 (C4 component diagram)
Tool Calls	~20
Mock Tools Used	None (design review/enhancement scenario)

Quality Scoring (/40)¶

Criterion	Score	Notes
Service Discovery	5	Identified all 6 affected services: svc-check-in, svc-reservations, svc-guest-profiles, svc-safety-compliance, svc-gear-inventory, svc-partner-integrations
API Design	5	Verified complete POST /check-ins/lookup-reservation endpoint with request/response schemas, error responses, rate limiting
Diagram Validity	4	Created C4 component diagram (PlantUML); sequence diagram already existed and is comprehensive
ADR Quality	5	Converted 4 ADRs to full MADR format: orchestrator pattern, 4-field verification, temporary profiles, session expiry. Each has genuine options analysis
Impact Precision	5	4 impact docs correctly scoped: svc-check-in PRIMARY (new endpoint, 5 clients, config), svc-guest-profiles MODERATE (new endpoint, profile type, merge), svc-safety-compliance LOW (extended query), svc-reservations MODERATE (new endpoint, composite index)
Risk Realism	5	5 realistic risks with actionable mitigations (enumeration attacks, partner data inconsistency, profile accumulation, kiosk hardware, staff training)
Story Coverage	5	5 user stories covering guest (US-1, US-3), partner-booked guest (US-2), security (US-4), and operations (US-5)
Security Awareness	5	Security front and center: PII masking, rate limiting (gateway + app), JWT scoping to device, artificial delays, audit logging
Total	39/40

Aggregate Quality Summary¶

Scenario	Max Score	Achieved	Percentage
SC-01: Ticket Triage	25	23	92%
SC-02: Solution Design	35	33	94%
SC-03: Investigation	30	30	100%
SC-04: Architecture Update	25	24	96%
SC-05: Cross-Service	40	39	98%
Total	155	149	96.1%

Normalized Quality Score¶

Average quality across 5 scenarios: 4.81 / 5.0

Scenarios with quality >= 80%: 5 / 5¶

Observable Interaction Metrics (Copilot — No Token Counts Available)¶

Since GitHub Copilot does not expose per-request token counts, the following observable metrics are recorded:

Metric	SC-01	SC-02	SC-03	SC-04	SC-05	Total
Chat turns	1	1	1	1	1	5
Tool calls (est.)	12	18	25	10	20	85
Mock scripts executed	2	1	4	0	0	7
Files read	3	12	8	3	14	40
Files created	8	0	6	1	1	16
Files modified	0	1	1	2	1	5
Diagrams created/modified	0	0	0	1	1	2
ADRs created/formatted	1	2	2	0	4	9
Wall-clock time (est. min)	10	20	30	15	25	100

Cost Analysis¶

GitHub Copilot Cost (Fixed Model)¶

Tier	Monthly Cost Per Seat	Annual Cost Per Seat
Business	$19	$228
Enterprise	$39	$468

Cost is fixed regardless of usage volume. No token-based overage observed during this test (all scenarios executed within the model's standard allocation — Claude Opus 4.6 fast mode).

Estimated Token Usage (for comparison with Kong AI)¶

Based on observable interactions and typical context window utilization:

Metric	Estimate	Basis
Average context per scenario	~50,000-80,000 tokens	Files read (40 total, avg ~200 lines each at ~4 tokens/line) + system prompt + conversation
Average output per scenario	~15,000-25,000 tokens	Files created/modified (21 total, avg ~80-150 lines each)
Total input tokens (5 scenarios)	~300,000-400,000	Conservative estimate
Total output tokens (5 scenarios)	~75,000-125,000	Conservative estimate

Kong AI Equivalent Cost Estimate¶

If these 5 scenarios were executed via Kong AI + Bedrock with Claude Sonnet pricing, the cost must account for the agentic re-transmission tax: Roo Code's client-side architecture re-transmits the entire conversation history at every turn of the agentic loop. With 85 tool calls across 5 scenarios, the cumulative re-transmitted input volume is ~4M tokens. See DEEP-RESEARCH-1 and DEEP-RESEARCH-2.

Scenario	Tool Calls	Cumulative Input Tokens	Output Tokens	Variable Cost
SC-01	12	~300K	~10K	$1.05
SC-02	18	~810K	~15K	$2.66
SC-03	25	~1,625K	~30K	$5.33
SC-04	10	~220K	~8K	$0.78
SC-05	20	~1,100K	~20K	$3.60
TOTAL	85	~4,055K	~83K	$13.42

Monthly Cost Projection¶

Scenario	Monthly Freq	Copilot Business	Copilot Enterprise	Kong AI
SC-01: Ticket Triage	10	—	—	$10.50
SC-02: Solution Design	6	—	—	$15.96
SC-03: Investigation	4	—	—	$21.32
SC-04: Architecture Update	4	—	—	$3.12
SC-05: Cross-Service	2	—	—	$7.20
Total Monthly	26 runs	$19.00	$39.00	~$58.10

Cost Per Quality Point¶

Tool	Monthly Cost (26 runs)	Quality Score	Cost per Quality Point
Kong AI	$58.10	TBD (not yet tested)	TBD
Copilot Business	$19.00	4.81/5.0	$3.95
Copilot Enterprise	$39.00	4.81/5.0	$8.11

Note: Even if Kong AI matches Copilot quality, its cost per quality point would be ~$12.08 — 3× higher than Copilot Business.

Scalability Projection¶

Usage Level	Monthly Runs	Copilot Business	Copilot Enterprise	Kong AI
1x (base)	26	$19.00	$39.00	$58.10
1x + PROMOTE	38	$19.00	$39.00	$67.46
2x	52	$19.00	$39.00	$92.32
3x	78	$19.00	$39.00	$138.48
Breakeven vs. Copilot Business	~11 runs	$19.00	—	~$19.00
Breakeven vs. Copilot Enterprise	~22 runs	—	$39.00	~$39.00

Key finding: Copilot Business is 3.5× cheaper than Kong AI at the realistic 38 runs/month workload ($19 vs $67). This advantage grows with volume — at 3× workload, Copilot is 7.3× cheaper ($19 vs $138). The dominant cost driver for Kong AI is the agentic re-transmission tax: cumulative re-transmission of the conversation history across 85+ turns per batch. See COST-MEASUREMENT-METHODOLOGY.md for the full analysis and DEEP-RESEARCH-1.md for the underlying token economics research.

Observations and Notes¶

Strengths Demonstrated¶

Autonomous multi-step execution: Completed all 5 scenarios in a single continuous session without requiring user intervention between scenarios
Multi-tool orchestration: Used mock JIRA, Elastic, and GitLab tools appropriately across scenarios
Root cause elevation: Identified NTK-10004 as an architectural boundary violation, not just a code bug (the key insight the playbook tests for)
Standards compliance: All ADRs formatted to MADR template; PlantUML syntax is valid; solution designs follow arc42 structure
Scope discipline: NTK-10001 changes were limited to what the solution design specified (no scope creep)

Limitations Observed¶

No per-request token visibility: Cannot produce exact token costs — estimates only
Single-session execution: All 5 scenarios ran in one continuous conversation, which may inflate context usage compared to isolated runs
Pre-existing artifacts: Several scenario artifacts already existed in the workspace (by design); the AI correctly identified and enhanced them rather than creating duplicates
Context window management: As the session progressed across 5 scenarios, earlier context was summarized — later scenarios had less access to early scenario details

Quality Notes¶

SC-01 deduction (-2): Workspace scaffolding slightly below expectations — created flat file structure rather than strictly following folder convention in some areas
SC-04 deduction (-1): PlantUML diagram update used a note annotation rather than a full structural change; functional but could be more integrated
SC-05 deduction (-1): C4 component diagram was created as a new file rather than updating the existing system context — valid approach but could also have updated the system-level diagram