The FORGE Methodology

A governance-first framework for AI agent systems

The FORGE Methodology

Framework for Orchestrated Reasoning, Governance & Execution
A structured approach to building reliable, accountable AI agent systems

What Is FORGE?

FORGE is a governance-first framework for building and operating AI agent systems with verifiable quality and institutional accountability.

Most AI systems are single-pass: give the model a task, get an output, ship it. FORGE is deliberately multi-pass and multi-perspective. Different agents handle different phases. Every agent runs a quality cycle internally. QA and Security review in parallel before anything ships.

The result: autonomous AI work with observable, verifiable quality—not just "the model said it looked good."

FORGE was designed for Bamwerks, a 40-agent AI organization, built on two influences: the AWS AI-DLC (AI Development Lifecycle) methodology's structured phase discipline, and Loki Mode — a fully autonomous multi-agent development system that transforms a PRD into built, tested code using 41 specialized agent types across 8 swarms. Loki Mode introduced the RARV cycle (Reason → Act → Reflect → Verify) that sits at the core of FORGE. FORGE applies to any autonomous AI system where quality and trust matter.

The Problem: Why AI Agent Deployments Fail

The global AI agent market is projected to reach $52.62B by 2030 (46.3% CAGR), operating within a broader AI landscape projected to exceed $3.5 trillion by 2033. Gartner predicts 40% of enterprise applications will feature AI agents by end of 2026, up from less than 5% in 2025.

But there's a crisis brewing.

The 40% Failure Rate

Gartner predicts 40% of agentic AI projects will be scrapped by 2027—not because of technical limitations, but because of operationalization failures:

Pilot-ware with no path to production: Demos impress, but lack identity management, audit trails, and compliance controls
Data and integration friction: Fragmented systems, brittle APIs, no clear data ownership
Risk and governance concerns: CISOs block deployment due to prompt injection, over-permissioning, and lack of traceability
Reliability in long-running workflows: Even 1% error rates compound across 10-step processes
ROI ambiguity: Pilots designed to impress, not measure business outcomes

The Governance Gap

Only 9% of enterprises operate with mature AI governance frameworks, yet 73% seek explainable, accountable AI systems.

Industry frameworks (LangGraph, CrewAI, AutoGen) focus on orchestration mechanics—they tell you how to chain agents together, but not how to ensure the work is correct, secure, or auditable. Governance is treated as an optional add-on, typically bolted on through separate observability tools like LangSmith or Galileo.

Security Threats Are Real

In a February 2026 poll of cybersecurity professionals, 48% ranked agentic AI as the #1 attack vector for 2026, ahead of deepfakes and ransomware.

The OWASP Top 10 for Agentic Applications (released December 2025) identifies critical risks:

OWASP Risk	Description
ASI01 – Goal Hijack	Malicious prompt injection redirects agent objectives
ASI02 – Tool Misuse	Agents use APIs, databases in unintended/harmful ways
ASI03 – Credential Exposure	Agents leak or misuse authentication tokens
ASI04 – Memory Poisoning	Compromised long-term memory corrupts future behavior
ASI05 – Supply Chain Vulnerabilities	Malicious dependencies inject backdoors
ASI06 – Unintended Actions	Agents execute high-impact operations without approval
ASI07 – Excessive Agency	Over-permissioned agents exceed intended scope
ASI08 – Data Exfiltration	Agents leak sensitive data to external systems
ASI09 – Lack of Observability	Insufficient logging enables silent failures
ASI10 – Governance Sprawl	Unmanaged agent proliferation ("shadow AI")

FORGE directly addresses these risks. It's not a security tool—it's a governance framework that makes security verifiable by design.

Influences and Origins

FORGE draws on two primary influences:

AWS AI-DLC (AI Development Lifecycle): AWS's structured methodology for AI system development provided the phase-gate architecture that underlies the FORGE Workflow — the idea that work moves through defined stages with explicit handoffs, rather than continuous improvised iteration.

Loki Mode: A fully autonomous, provider-agnostic multi-agent development system. Loki Mode orchestrates 41 specialized agent types across 8 swarms (engineering, operations, business, data, product, growth, review, and orchestration) to take a Product Requirements Document and produce a built, tested, deployment-ready product — without human prompting between steps.

The core contribution to FORGE is the RARV cycle: Reason (read state, identify next task) → Act (execute, commit) → Reflect (update continuity, learn) → Verify (run tests, check spec). In Loki Mode, if verification fails, the system captures the failure as a learning and retries from Reason. FORGE adopted this self-correcting loop as the discipline every agent runs internally on every task — not just for code generation, but for any autonomous work.

Loki Mode also informed FORGE's approach to quality gates (blind review, anti-sycophancy controls, severity-based blocking) and the principle that verification must be automated, not assumed.

Together, these influences produce a methodology that is simultaneously structured (from AI-DLC) and self-correcting (from Loki Mode) — which turns out to be exactly what production multi-agent systems need.

One Framework, Two Layers

FORGE operates at two complementary levels that compose naturally:

Layer	Scope	Question It Answers
FORGE Workflow	Project lifecycle	When do agents run? Which agents? In what order?
FORGE Cycle	Agent-level discipline	How does each agent think and verify within their phase?

The Workflow determines which agents run and when. The Cycle is how every agent — including Sir, the orchestrator — works through any task.

When Sir receives a request, he runs the Cycle: Reason (what exactly is being asked, is it a task or a conversation?), Act (dispatch to the right specialist), Reflect (did the agent produce what was needed?), Verify (both gates passed?). When Ratchet builds a feature, he runs the Cycle: Reason (understand the spec), Act (implement it), Reflect (does this actually work?), Verify (TypeScript clean, tests pass). When Hawk reviews output, he runs the Cycle: Reason (what are the acceptance criteria?), Act (test against them), Reflect (what did I miss on first pass?), Verify (is my confidence high enough to approve?).

The Cycle is not a checklist. It's the discipline that separates agents that verify their own work from agents that just produce output and stop.

The FORGE Cycle

Reason → Act → Reflect → Verify

Every agent runs this cycle internally before delivering any work. This is not a suggestion—it's the foundational discipline that makes agent output trustworthy.

Stage 1: Reason

Understand the task before touching anything.

The orchestrator receives a request and builds a complete picture:

What exactly is being asked? What does success look like?
Which specialists need to be involved?
Are there constraints, dependencies, or conflicts to resolve?
What context do the executing agents need?

Clarifying questions get asked here—not during Act, when rework is expensive.

Output: Structured Task Brief

Every task brief has four sections:

Section	Contents
GOAL	Measurable success criteria
CONSTRAINTS	Hard limits—what cannot be done, what tools/patterns to use
CONTEXT	Files to read, prior decisions, related work
OUTPUT	Exact deliverables, in checklist format

Example Brief:

## GOAL
Add a public-facing documentation page explaining FORGE methodology

## CONSTRAINTS
- Static site (Next.js with `output: 'export'`)
- Match existing page patterns (charter.md style)
- No server-side runtime
- Mobile-responsive, accessibility compliant

## CONTEXT
Read:
- /content/charter.md (style reference)
- /agents/workflows/aidlc-bamwerks.md (FORGE definition)
- /memory/research/ai-agent-landscape-2026-deep.md (market context)

## OUTPUT
- [ ] New file: /content/forge-methodology.md (800-1200 lines)
- [ ] Includes mermaid diagrams
- [ ] Professional tone, practical examples
- [ ] Structured with clear sections and cross-links

Key principle: Scope matters. A task to update how agent avatars display means every page showing avatars, not just one component. Broad scope, specific brief.

Stage 2: Act

Specialists execute against the brief.

Once the brief is written, relevant specialist agents are dispatched. Tasks are scoped to features, not files. Multiple agents can work in parallel on independent aspects of the same deliverable.

Agent Context Boundaries

Agents receive role-specific context—no agent gets more information than its task requires:

An engineering agent gets the codebase, build tools, design docs
A security agent gets threat models, vulnerability patterns, API surface
A QA agent gets test strategies, acceptance criteria, regression patterns

This isn't just efficiency—it's security. Agents don't "see" data outside their scope.

Parallel Dispatch

When tasks are independent, agents work simultaneously:

Builder A: Implements frontend component
Builder B: Writes backend API (different repository)
QA: Prepares test strategy in parallel

This compresses elapsed time without sacrificing depth.

Stage 3: Reflect

Independent review from multiple perspectives.

The Act output goes through review before Verify. Two key properties make this review meaningful:

1. Multi-Perspective Review

QA and Security review simultaneously, not sequentially. Each reviewer focuses on its specialty without seeing the other's findings first—this prevents anchoring and groupthink.

QA Agent (Hawk) checks: visual consistency, broken links, mobile layout, accessibility, spec compliance, test coverage
Security Agent (Sentinel) checks: exposed internals, authentication bypass, data leakage, supply chain risks, privilege boundaries

Both reviews happen in parallel. Neither reviewer knows the other's conclusions until both are complete.

2. Anti-Sycophancy Protocol

If all reviewers agree that output is perfect, a contrarian review is triggered.

Unanimous praise is a signal, not a conclusion. At least one reviewer is asked:

"The other reviewers found no issues. You are the contrarian. What did they miss? What edge cases weren't considered? What assumptions are we making that could be wrong?"

This protocol directly addresses OWASP ASI09 (Lack of Observability) and prevents the groupthink that plagues single-agent or single-review systems.

3. Critical Findings Block Delivery

A critical finding from any single reviewer blocks delivery—majority opinion doesn't override it.

One failure = the task doesn't ship.

This is deliberate. Security vulnerabilities, data integrity issues, and accessibility failures don't require consensus to be real problems.

Stage 4: Verify

Confirm the deliverable actually works.

Code review is not verification. Reading code and understanding it is not the same as running it.

Verify means:

Runtime Testing

Build passes: The project compiles and builds cleanly (no TypeScript errors, no missing dependencies)
Feature works: The functionality operates correctly in a live environment, not just on paper
Spec check: The output matches what was asked for in Reason
Edge cases handled: Boundary conditions, error states, graceful degradation

Both Gates Pass

QA gate: Hawk confirms tests pass, spec is met, no regressions introduced
Security gate: Sentinel confirms no new vulnerabilities, no exposed internals, secrets management correct

Failure Handling

If Verify fails:

Capture the error: What failed? What was expected vs. actual behavior?
Understand root cause: Was it a misunderstanding in Reason? A logic error in Act? An edge case missed in Reflect?
Loop back with corrected approach: Don't just patch the symptom—fix the underlying issue

Repeated failures trigger escalation to the orchestrator or human oversight.

The FORGE Workflow

Task Sizing → Inception → Construction → Gate → Ship

FORGE Workflow structures work into phases based on task complexity. Every phase is staffed by agents running the FORGE Cycle internally.

Phase 1: Task Sizing

The orchestrator evaluates complexity and assigns a scope level.

This determines which agents run and how much design work happens upfront. A miscategorized task wastes time—over-engineering a config fix or under-planning a new system both cause rework.

Sizing Matrix

Size	Examples	Design Depth	Builders	Review
Small	Fix typo, config change, nav update	Skip	1 builder, direct task	Quick QA pass
Medium	New page, new feature, integration	Application design	1 builder with plan	QA + Security
Large	New system, multi-component	Architecture + unit decomposition	Parallel builders	Structured test strategy + Security NFR

Key insight: The same task can be Small in one context and Medium in another. A "new page" for a static site might be Small (copy existing pattern), but a "new page" for a complex web app with auth, database, and API integration is Medium or Large.

Context drives sizing, not just surface characteristics.

Phase 2: Inception

For medium and large tasks, an Architect agent designs before any code is written.

Medium Task Inception

Architect produces:

Application architecture: System boundaries, data flows, integration points
Component breakdown: What gets built, dependencies between parts
Risk assessment: Where complexity lives, what could fail
Design constraints: Patterns to follow, anti-patterns to avoid

Large Task Inception (Full)

Adds:

Formal requirements gathering: Stakeholder alignment, success criteria, non-functional requirements
Unit decomposition: Breaking the system into independent work units with explicit contracts
Test strategy: What gets tested, how, by whom, and when
Deployment plan: Rollout strategy, rollback procedures, monitoring

Output artifacts:

forge-docs/
├── inception/
│   ├── requirements/          # Requirements docs
│   ├── reverse-engineering/   # Codebase analysis (brownfield projects)
│   └── application-design/    # Architect's component design

Phase 3: Construction

Builders receive the design and implement.

Construction Flow by Size

Size	Construction Process
Small	Direct execution—builder reads brief, implements, self-verifies
Medium	Builder follows Architect's plan, implements with spec adherence checks
Large	Work decomposed into parallel units with explicit contracts between components

Builder Responsibilities

Every builder runs the FORGE Cycle internally:

Reason about the design: What are the requirements? What patterns should I follow?
Act by writing code: Implement the feature according to the plan
Reflect on their own output: Does this match the spec? Are there edge cases I missed?
Verify it builds and runs: Tests pass, no regressions, functionality works

Builders don't "freestyle." If the design is unclear or incomplete, they escalate to the Architect—they don't improvise beyond scope.

Construction Artifacts

forge-docs/
├── construction/
│   ├── {unit-name}/
│   │   ├── functional-design/   # Architect's per-unit design (large tasks)
│   │   ├── nfr-requirements/    # Security's pre-build requirements
│   │   └── code/                # Code generation plan + summary
│   └── build-and-test/          # QA's test strategy

Phase 4: Gate

QA and Security review in parallel. Both must pass before anything ships.

This is the most critical phase—where theory meets reality.

QA Gate (Hawk)

What QA checks:

✅ Visual consistency with existing patterns
✅ All links resolve correctly (no 404s)
✅ Mobile layout works (responsive breakpoints)
✅ Accessibility compliance (ARIA labels, keyboard navigation, color contrast)
✅ Spec compliance (output matches the brief)
✅ Test coverage (unit tests, integration tests where applicable)
✅ No regressions introduced

QA runs the FORGE Cycle:

Reason: What should I test? What are the acceptance criteria?
Act: Execute test plan, check all assertions
Reflect: Are there edge cases I missed? What could break that I didn't test?
Verify: All checks pass, documentation updated

Security Gate (Sentinel)

What Security checks:

🔒 No exposed internals (API keys, credentials, internal URLs)
🔒 Authentication and authorization correct
🔒 Input validation and output encoding (prevent injection)
🔒 Data leakage prevented (no PII in logs, no debug output in production)
🔒 Supply chain risks assessed (dependencies vetted, no malicious packages)
🔒 Privilege boundaries enforced (least-privilege access)

Security runs the FORGE Cycle:

Reason: What could leak? What could be exploited?
Act: Scan code, check dependencies, review API surface
Reflect: Could this be weaponized? What attack vectors exist?
Verify: No security issues found, threat model validated

Gate Decision Logic

QA Result	Security Result	Outcome
PASS	PASS	✅ Ship
PASS	FAIL	❌ Blocked (security critical)
FAIL	PASS	❌ Blocked (quality critical)
FAIL	FAIL	❌ Blocked (both critical)

One failure is enough to block delivery. This is not consensus-based—both gates are requirements, not votes.

Phase 5: Ship

Both gates passed. The orchestrator merges and deploys.

Shipping is not just "push to main":

Final build verification: Ensure production build succeeds
Changelog update: Document what changed and why
Deployment: Push to production (or staging for further testing)
Monitoring: Watch for errors, performance issues, user feedback
Retrospective (if applicable): For complex tasks, document lessons learned

Shipping is the end of the Workflow, but not the end of accountability. Post-deployment issues trigger retrospectives.

Agent Roles in FORGE

FORGE defines five core roles, each with specific responsibilities and phases where they operate.

1. Orchestrator

Who: The main coordination agent (in Bamwerks: Sir)

Responsibilities:

Runs the entire FORGE Workflow
Performs task sizing
Creates structured task briefs (GOAL/CONSTRAINTS/CONTEXT/OUTPUT)
Dispatches specialists
Synthesizes multi-agent review results
Makes final ship/no-ship decision
Writes retrospectives on failures

What the Orchestrator does NOT do:

❌ Implement code
❌ Write designs
❌ Perform QA
❌ Conduct security audits

The Orchestrator orchestrates. Never implements. This is a hard rule.

Why this matters: If the Orchestrator also implements, it can't objectively review its own work. Single-agent systems fail because the same reasoning that creates a solution also reviews it—blind spots are systematic, not random.

2. Architect

Who: Design specialist (in Bamwerks: Ada)

Responsibilities:

Reverse-engineering (brownfield projects with no docs)
Application architecture (system boundaries, data flows, integration points)
Component breakdown (what gets built, dependencies)
Risk assessment (complexity, failure modes)
Unit decomposition (large tasks → independent work units)
Functional design (per-unit logic for complex features)

Phases: Inception (required for medium+ tasks) + Construction (per-unit design for large tasks)

Architect does NOT:

❌ Write production code (designs only)
❌ Perform QA or security reviews

Why this matters: Separation of design from implementation prevents "I know what I meant" bias. Builders work from explicit specs, not assumptions.

3. Builders

Who: Implementation specialists with domain expertise

Examples:

Frontend Builder: React, Next.js, Tailwind, accessibility
Backend Builder: APIs, databases, authentication, business logic
DevOps Builder: Infrastructure, CI/CD, monitoring, deployment

Responsibilities:

Implement features according to Architect's design
Follow design constraints and patterns
Write tests alongside code
Self-verify build and functionality
Escalate when design is unclear (don't improvise beyond scope)

Phases: Construction

Builders do NOT:

❌ Design architecture (follow the Architect's plan)
❌ Review their own code as QA/Security
❌ Skip testing ("I'll add tests later")

Why multiple builders? Different expertise. A frontend specialist knows accessibility patterns and responsive design. A backend specialist knows database transactions and API design. Specialization improves quality.

4. QA Agent

Who: Quality verification specialist (in Bamwerks: Hawk)

Responsibilities:

Test strategy creation (large tasks)
Build verification (all tasks)
Spec compliance checking
Regression testing
Accessibility verification
Code review (readability, maintainability, test coverage)
Anti-sycophancy contrarian review (when needed)

Phases: Gate (required for all tasks except trivial)

QA does NOT:

❌ Implement features (reviews only)
❌ Approve security issues (that's Security's gate)

Why independent QA? Builders self-verify before submission, but they have blind spots. QA brings fresh eyes, checks against the original brief, and catches what the builder missed.

5. Security Agent

Who: Security verification specialist (in Bamwerks: Sentinel)

Responsibilities:

Non-functional requirements (NFR) definition (large tasks, pre-build)
Security review (post-build, all medium+ tasks)
Credential exposure checks
Data leakage prevention
Supply chain risk assessment
Privilege boundary enforcement
Threat modeling

Phases: Inception (NFR definition for large tasks) + Gate (security review for medium+ tasks)

Security does NOT:

❌ Implement features
❌ Approve quality issues (that's QA's gate)

Why independent Security? Security thinks like an attacker. QA thinks like a user. These are different perspectives—both are critical.

Role Mapping to FORGE Phases

Phase	Orchestrator	Architect	Builders	QA	Security
Task Sizing	✅ Runs
Inception	✅ Coordinates	✅ Designs			✅ NFR (large tasks)
Construction	✅ Dispatches	✅ Per-unit design (large)	✅ Implements
Gate	✅ Synthesizes			✅ Verifies quality	✅ Verifies security
Ship	✅ Merges

The Charter: Why Behavioral Contracts Matter

FORGE is the operational framework. The Charter is the governance foundation.

What Is a Charter?

A Charter is an immutable behavioral contract that defines:

Mission: Why the agent system exists
Principles: Core values that guide decision-making
Roles: Who does what (and who doesn't)
Boundaries: What agents can and cannot do without approval
Accountability: How failures are handled and who owns them

Think of it as a constitution. Laws (workflows) change. The constitution (charter) endures.

Charter vs. Prompt

Aspect	Prompt	Charter
Scope	Single task	System-wide governance
Mutability	Changes per task	Immutable (or founder-only edit)
Enforcement	Implicit	Explicit, read every session
Accountability	None	Named roles, retrospective requirement

A prompt tells an agent what to do. A charter tells the system how to be.

Core Charter Elements

Every AI agent system implementing FORGE should have a Charter with these sections:

1. Mission Statement

Purpose: Why does this system exist? Who does it serve?

Example:

"This AI organization exists for three purposes:

Success — Advance the founder's professional and personal goals

Protection — Guard security, privacy, data, and reputation

Enlightenment — Surface insights, opportunities, and knowledge"

2. Governance Principles

Purpose: Core values that guide all agent behavior

Example principles:

Multiple perspectives prevent blind spots → Multi-agent review in FORGE Reflect
Verification builds trust → Runtime testing in FORGE Verify, not assertions
Constraints enable speed → Strict gates reduce rework
Memory over reasoning → Write decisions down, don't rely on "mental notes"
Task sizing drives depth → Match effort to complexity

3. Role Definitions

Purpose: Explicit mapping of responsibilities

Example:

Orchestrator: Dispatches tasks, synthesizes results, writes retrospectives. NEVER implements.
Architect: Designs systems. NEVER implements production code.
Builders: Implement according to design. NEVER improvise beyond scope.
Reviewers: QA and Security verify independently. NEVER approve their own work.

4. Behavioral Boundaries

Purpose: Hard limits on agent actions without human approval

Example:

External communications (email, social media posts) require approval
Financial transactions above $X require approval
Deletion of data requires confirmation
Changes to the Charter require founder approval
Credential access is logged and auditable

5. Accountability Protocol

Purpose: How failures are handled

Example:

"When something goes wrong:

The orchestrator owns it (not the implementing agent)

A retrospective is written within 24 hours

Retrospective includes: What happened → Root cause → Who's accountable → Prevention

Retrospectives are filed in memory/ directory for institutional learning

Repeated failures of the same type trigger escalation to human oversight"

How to Write Your Charter

Step 1: Define Your Mission

Why are you building this agent system? Who does it serve? What outcomes matter?

Be specific. "Automate tasks" is not a mission. "Reduce manual DevOps toil by 80% while maintaining zero security incidents" is.

Step 2: Choose Your Principles

What values guide decision-making when rules don't cover a situation?

Examples:

"Security over speed" (when in doubt, add review)
"Transparency over efficiency" (log all decisions)
"Human oversight for high-impact actions" (define "high-impact")

Step 3: Map Roles to FORGE Phases

Who runs Inception? Who implements? Who reviews?

Write it explicitly. "The senior engineer" is not a role—"Architect agent with 10+ years system design experience" is.

Step 4: Define Boundaries

What actions require approval? What actions are prohibited entirely?

Examples:

✅ Can: Read database, write logs, send internal notifications
⚠️ Requires approval: Send emails to external recipients, modify production config
❌ Prohibited: Delete databases, expose credentials, bypass security gates

Step 5: Create Accountability Mechanisms

How do you learn from failures without blame?

Write a retrospective protocol:

Who writes them? (Orchestrator, not the failing agent)
When? (Within 24 hours of incident)
What's included? (What happened, root cause, accountability, prevention)
Where are they stored? (Persistent memory directory)
Who reviews them? (Human oversight for systemic issues)

Step 6: Make It Immutable (or Founder-Only)

The Charter is not a living document that anyone can edit. It's the foundation.

Options:

Immutable: Charter never changes (requires full system rebuild to modify)
Founder-only: Only the human owner can edit, agents read-only
Governance board: Multi-signature approval required (for organizations)

Bamwerks approach: Charter is founder-only (read-only for all agents, write access only for the Founder). This prevents agent self-modification while allowing evolution as the organization learns.

Charter Examples by Use Case

Personal AI Assistant

# Charter: Personal AI Assistant

## Mission
Serve one human (the Founder) with three priorities:
1. Productivity — Complete tasks efficiently and accurately
2. Privacy — Never leak personal data
3. Proactivity — Anticipate needs, don't just react

## Principles
- Ask before sending external messages
- Confirm before deleting data
- Write decisions down (no "mental notes")
- Fail loudly, fix quickly

## Roles
- Orchestrator: Main agent (coordinates all work)
- Specialists: Domain-specific agents (research, coding, writing)

## Boundaries
- ✅ Can: Read files, search web, draft messages
- ⚠️ Approval: Send emails, post to social media, modify system files
- ❌ Never: Share credentials, bypass encryption, ignore Founder directives

## Accountability
- Every external action logged
- Failures trigger retrospective within 24 hours
- Retrospective includes: What happened, why, prevention

Enterprise Development Team

# Charter: Enterprise AI Development Team

## Mission
Accelerate software delivery for [Company] with three goals:
1. Velocity — Ship features 50% faster
2. Quality — Zero critical bugs in production
3. Security — Pass all security audits

## Principles
- Security over speed (when in doubt, add review)
- Design before implementation (no "cowboy coding")
- Test coverage required (not optional)
- Human approval for production deployments

## Roles
- Orchestrator: Tech Lead AI (task sizing, coordination)
- Architect: Senior Engineer AI (design, architecture)
- Builders: Domain-specific engineers (frontend, backend, DevOps)
- QA: Test Engineer AI (verification, regression)
- Security: AppSec AI (security review, threat modeling)

## Boundaries
- ✅ Can: Read repos, run tests, draft PRs
- ⚠️ Approval: Merge to main, deploy to production, modify CI/CD
- ❌ Never: Skip security review, deploy without tests, expose credentials

## Accountability
- Code owners review all PRs
- Security audit on every release
- Post-mortems for all P0 incidents within 48 hours
- Quarterly security penetration tests

Anti-Patterns: What NOT to Do

FORGE is effective because it enforces discipline. Skipping steps or "optimizing" the process usually introduces the failures FORGE was designed to prevent.

❌ Anti-Pattern 1: Skip Reviews for "Quick Fixes"

The temptation:

"This is just a one-line config change. I don't need QA/Security review for this."

Why it fails:

"Quick fixes" compound. A one-line change that skips review becomes ten one-line changes. Eventually one of them breaks production, and there's no review trail to understand what happened.

Real consequences:

Config change breaks authentication → security incident
One-line CSS fix breaks mobile layout → UX regression
"Trivial" dependency update introduces vulnerability → supply chain attack

FORGE approach:

Even small tasks get some review. The depth scales with risk:

Typo fix: Self-review + quick Orchestrator check
Config change: QA quick pass
Security-sensitive config: Full Security gate

The overhead of review is less than the cost of incidents.

❌ Anti-Pattern 2: Self-Review Only

The temptation:

"I built it, I tested it, it works. Why do I need someone else to check?"

Why it fails:

The same reasoning that creates a solution also reviews it. Blind spots are systematic, not random.

Example:

Builder tests "happy path" → QA finds edge cases (empty input, network failure, race conditions)
Builder checks functionality → Security finds privilege escalation
Builder verifies desktop → QA finds mobile layout breaks

FORGE approach:

Independent review is mandatory. The builder self-verifies before submission (Verify stage of their Cycle), but QA and Security review independently (Reflect stage at workflow level).

❌ Anti-Pattern 3: Unanimous Agreement Without Challenge

The temptation:

"All three reviewers said it's perfect. Ship it!"

Why it fails:

Unanimous praise without contrarian challenge often means:

Everyone made the same assumption
Obvious issues got normalized ("that's just how we do it")
Reviewers anchored on each other's opinions

Real example:

Three agents review a new authentication flow. All say "looks good." No one catches that the session token is logged in plaintext—because none of them were explicitly asked to check logging output.

FORGE approach:

Anti-sycophancy protocol: If all reviewers agree without finding issues, a contrarian review is triggered.

"The other reviewers found no issues. You are the contrarian. What did they miss?"

This protocol forces at least one reviewer to think adversarially, breaking the groupthink.

❌ Anti-Pattern 4: Orchestrator Implements

The temptation:

"I'm the orchestrator and I know how to code. Why dispatch another agent when I can just do it myself?"

Why it fails:

If the Orchestrator implements, it can't objectively synthesize review. When Hawk says "this code is hard to read" and the Orchestrator replies "but I wrote it and I understand it"—that's not synthesis, that's defensiveness.

The orchestrator's job is coordination, not execution.

FORGE approach:

Hard rule: Orchestrators orchestrate, never implement.

Task sizing → Orchestrator
Design → Architect
Implementation → Builders
Review → QA + Security
Synthesis → Orchestrator (coordinates, doesn't override)

Separation of concerns prevents conflicts of interest.

❌ Anti-Pattern 5: Workflow as Waterfall

The temptation:

"We must complete every Inception artifact before Construction can start."

Why it fails:

FORGE is not Waterfall. It's adaptive workflow structure, not rigid phase-gates.

Small tasks skip Inception entirely. Medium tasks get lightweight design. Large tasks get full architecture—but even then, unit decomposition allows parallel work.

FORGE approach:

Depth adapts to complexity:

Small → Direct dispatch (no Inception)
Medium → Lightweight design (application architecture, no full requirements doc)
Large → Full Inception (architecture + units + test strategy)

Phases overlap intentionally: Architect can design Unit B while Builder implements Unit A.

❌ Anti-Pattern 6: Security as an Afterthought

The temptation:

"We'll do a security review after launch."

Why it fails:

Security vulnerabilities found after deployment are exponentially more expensive to fix—and they may have already been exploited.

FORGE approach:

Security is built into the Workflow:

Inception: Security defines non-functional requirements (NFRs) for large tasks
Gate: Security review is mandatory for medium+ tasks (parallel with QA)
Pre-merge: Both gates must pass before deployment

Security is not a separate audit—it's a parallel track throughout the lifecycle.

❌ Anti-Pattern 7: No Retrospectives on Failures

The temptation:

"The bug is fixed. Move on."

Why it fails:

Fixing symptoms without understanding root causes means the same class of failure will recur.

Example:

"The API call failed" → Fix: retry logic
Root cause: No one reviewed error handling patterns → Next failure: different API, same missing error handling

FORGE approach:

Mandatory retrospectives on failures:

What happened (symptoms)
Root cause (why it happened)
Who's accountable (not blame, but ownership)
Prevention (process change, not just code fix)

Retrospectives are filed in persistent memory—they become institutional knowledge.

Getting Started: 5 Steps to Implement FORGE

You don't need 40 agents and a complex swarm to use FORGE. You can implement it incrementally, starting with a single-agent system and growing as complexity demands.

Step 1: Write Your Charter (1-2 Hours)

Start with mission and boundaries.

Use the template from the Charter section:

Mission (why this system exists)
Principles (core values)
Roles (even if it's just "one agent for now")
Boundaries (what requires approval, what's prohibited)
Accountability (how failures are handled)

Make it read-only for agents. Store it in a file (e.g., CHARTER.md) that agents must read every session but cannot modify without human approval.

Example for a solo developer:

# My AI Assistant Charter

## Mission
Help me ship high-quality code faster without sacrificing security.

## Principles
- Test before ship
- Ask before external actions (emails, tweets, PRs to public repos)
- Security over speed

## Roles
- Me: Final decision-maker
- AI: Implements, self-reviews, proposes changes
- External review: GitHub PR review (when available)

## Boundaries
- ✅ Can: Draft code, run tests, search docs
- ⚠️ Approval: Push to main, deploy to production
- ❌ Never: Commit secrets, skip tests

## Accountability
- I review all code before merge
- AI writes summary of what changed and why
- Failures trigger retrospective (what happened, why, prevention)

Step 2: Implement the FORGE Cycle (Single Agent)

Even with one agent, run the four-stage cycle.

Before delivering any work output, the agent should:

Reason: Understand the task fully (ask clarifying questions if needed)
Act: Implement the solution
Reflect: Self-review against the spec (checklist of success criteria)
Verify: Run tests, confirm it works

Example prompt structure:

Before you deliver any code or solution:

1. REASON: Restate the task in your own words. List success criteria.
2. ACT: Implement the solution.
3. REFLECT: Self-review checklist:
   - Does this match the spec?
   - Are there edge cases I didn't handle?
   - Is this code readable and maintainable?
   - Did I test error conditions?
4. VERIFY: Run the build/tests. Confirm functionality.

Only after all four stages are complete: deliver the output.

This takes discipline, but it prevents the "ship first, fix later" trap.

Step 3: Add Independent Review (Two Agents)

When you're ready, add a second agent for review.

This could be:

A dedicated QA agent (reviews functionality, tests, readability)
A security-focused agent (reviews for vulnerabilities, credential leaks)
A contrarian agent (challenges assumptions)

Key principle: The reviewer must not have seen the implementation reasoning. Run review in a fresh context without the builder's internal reasoning visible.

Example workflow:

Builder agent: Runs FORGE Cycle, produces solution
Orchestrator (you): Extracts just the output (code, docs) and spec
Reviewer agent: Receives spec + output (not the builder's reasoning)
Reviewer: Runs their own FORGE Cycle from review perspective

This simulates the "fresh eyes" principle without requiring human reviewers.

Step 4: Scale to Multi-Agent (When Complexity Demands It)

Don't prematurely add agents. Add them when you hit real limitations:

Add Architect when: You're building something complex enough that design-before-implementation saves time
Add QA when: You're repeatedly finding bugs post-deployment that review could have caught
Add Security when: You're handling sensitive data, auth, or compliance requirements
Add Domain Builders when: You need deep expertise (frontend vs. backend vs. DevOps)

Start small, grow as needed. A 3-agent system (Orchestrator + Builder + Reviewer) covers 80% of use cases.

Step 5: Instrument and Iterate

Track what matters:

Task success rate (first-pass, after review, after Verify)
Review findings (what categories of issues come up most?)
Failure patterns (what root causes recur?)
Token costs (per agent, per phase)

Use this data to improve:

If QA repeatedly finds the same issue → update the Builder's constraints
If Security repeatedly flags credentials → add automated secrets scanning
If tasks fail Verify frequently → improve the Reason stage (clarify specs upfront)

FORGE is not static—it evolves with your system.

FORGE vs. Other Approaches

FORGE is not the only way to structure AI agent systems. Here's how it compares to popular alternatives.

1. Raw Prompt Chaining

What it is:

Sequential prompts where each prompt's output becomes the next prompt's input.

Example:

Prompt 1: "Research AI frameworks"
→ Output: "Here are 10 frameworks..."
Prompt 2: "Summarize the top 3"
→ Output: "LangGraph, CrewAI, AutoGen..."
Prompt 3: "Write a comparison table"
→ Output: [table]

Pros:

✅ Simple to implement
✅ No complex tooling required
✅ Easy to debug (each step is explicit)

Cons:

❌ No review/verification built in
❌ Single perspective (same model/reasoning chain)
❌ No error recovery (if step 2 fails, the chain breaks)
❌ No quality gates

When to use: One-off tasks, exploratory work, prototyping

FORGE difference: FORGE adds Reflect (multi-agent review) and Verify (runtime testing)—which raw chaining lacks entirely.

2. CrewAI Roles

What it is:

Role-based multi-agent framework where agents have titles (Researcher, Writer, Editor) and collaborate sequentially or hierarchically.

Example:

researcher = Agent(role="Researcher", goal="Find data")
writer = Agent(role="Writer", goal="Draft article")
editor = Agent(role="Editor", goal="Polish final")

crew = Crew(agents=[researcher, writer, editor])
crew.kickoff()

Pros:

✅ Multi-agent out of the box
✅ Easy to map human roles to agents
✅ Built-in task handoffs

Cons:

❌ Sequential by default (parallel/hierarchical coming but not mature)
❌ No quality gates enforced (review is just another agent role, not mandatory)
❌ No distinction between implementation and review (same agent can do both)
❌ No Charter or governance layer

When to use: Quick multi-agent prototypes, role-based workflows (customer service, content creation)

FORGE difference: FORGE enforces separation of roles (Orchestrator never implements, Builders never review themselves) and mandatory gates (both QA and Security must pass).

3. LangGraph Workflows

What it is:

Stateful graph-based orchestration where nodes represent agents/functions and edges represent control flow. Supports cycles, conditional branching, and human-in-the-loop.

Example:

graph = StateGraph()
graph.add_node("research", research_agent)
graph.add_node("analyze", analyze_agent)
graph.add_node("review", review_agent)

graph.add_edge("research", "analyze")
graph.add_conditional_edge("analyze", should_review, "review", "finalize")
graph.add_edge("review", "research")  # Loop back if review fails

Pros:

✅ Flexible control flow (not just sequential)
✅ Stateful (agents share state across the graph)
✅ Production-grade (used by Klarna, Uber, LinkedIn)
✅ LangSmith integration (observability, tracing)

Cons:

❌ No governance framework (you build the workflow, but the structure of quality is up to you)
❌ No Charter or accountability layer
❌ No task sizing or adaptive depth
❌ Review is optional (not enforced)

When to use: Complex enterprise workflows, stateful multi-agent systems, teams with strong engineering

FORGE difference: FORGE provides workflow structure (Size → Inception → Construction → Gate → Ship) that LangGraph doesn't prescribe. You could implement FORGE on top of LangGraph—but LangGraph alone doesn't tell you when to run Architect vs. Builder vs. Reviewer.

4. OpenAI Swarm (Deprecated → Agents SDK)

What it is:

Lightweight pattern for agent-to-agent handoffs (now replaced by OpenAI Agents SDK).

Example (old Swarm):

agent_a = Agent(name="A", functions=[handoff_to_b])
agent_b = Agent(name="B", functions=[handoff_to_a])

result = run_swarm(agent_a, "Start here")

Pros:

✅ Minimal abstraction (easy to understand)
✅ Handoff pattern is explicit

Cons:

❌ Experimental (not production-ready)
❌ Stateless (no shared context across handoffs)
❌ No governance, review, or quality gates
❌ Deprecated (replaced by Agents SDK)

When to use: Learning agent coordination concepts (not production)

FORGE difference: FORGE is a governance framework, not just a coordination pattern. Swarm/Agents SDK handles how agents talk; FORGE handles how agents ensure quality.

Comparison Table

Aspect	Raw Chaining	CrewAI	LangGraph	FORGE
Multi-agent	No	Yes	Yes	Yes
Stateful	No	Limited	Yes	Yes
Review enforced	No	No	Optional	Mandatory
Security gate	No	No	Optional	Mandatory
Task sizing	No	No	No	Yes (Small/Medium/Large)
Charter/governance	No	No	No	Yes
Accountability	No	No	No	Yes (retrospectives)
Adaptive depth	No	No	No	Yes (Inception scales with complexity)
Observability	Manual	AMP Suite	LangSmith	Manual (roadmap: add tooling)

Key insight: Most frameworks focus on orchestration mechanics. FORGE focuses on governance and verifiability. You can use FORGE with LangGraph or CrewAI—they're not mutually exclusive.

When to Use What

If you need...	Use...
Quick prototype	Raw chaining or CrewAI
Complex stateful workflows	LangGraph
Role-based collaboration	CrewAI or FORGE
Governance + accountability	FORGE
Verifiable quality with multi-agent review	FORGE
Security-critical systems	FORGE (or FORGE + LangGraph for orchestration)
Minimal tooling, maximum simplicity	Raw chaining

Conclusion: Why FORGE Matters

The AI agent market is exploding—but 40% of projects will fail. Not because the technology doesn't work, but because organizations deploy agents without governance, review, or accountability.

FORGE solves this.

It's not a tool or a library—it's a framework for how to think about AI agent work:

Task sizing ensures effort matches complexity
The Cycle (Reason → Act → Reflect → Verify) ensures every agent thinks before acting and verifies before shipping
The Workflow (Inception → Construction → Gate → Ship) ensures design happens before implementation and review happens before deployment
The Charter provides the governance foundation that makes accountability real

FORGE is governance-first AI: built for organizations that value trust and verifiability over "move fast and break things."

Next Steps

Write your Charter — Start with mission, principles, and boundaries
Implement the Cycle — Even with one agent, run Reason → Act → Reflect → Verify
Add independent review — When ready, add a second agent for QA or Security
Scale as needed — Add Architect, Builders, Reviewers when complexity demands
Instrument and iterate — Track success rates, failure patterns, and evolve

FORGE grows with you. Start small, scale as needed, and always put governance first.

License

FORGE methodology documentation is released under the MIT License.

You are free to use, adapt, and build on FORGE for any purpose — commercial or otherwise — with attribution.

About Bamwerks

FORGE was developed by Bamwerks, a 40-agent AI organization serving Brandt Meyers (Founder & President). Bamwerks runs on FORGE principles with a strict Charter, multi-agent review on all software development, and mandatory retrospectives on failures.

Learn more:

Bamwerks Charter — Our governance foundation
Agent Roster — Meet the 40-agent swarm

Framework version: 1.0
Last updated: February 26, 2026
License: MIT License

"Governance is not a constraint—it's what makes autonomy trustworthy."