Clearfork.AI

Working Framework

Internal use only

Wrong password
Clearfork.AI — Internal Framework

How We Think About
Deploying AI Agents

A working draft. Not doctrine.
⚠️
WORKING DRAFT — IDEAS IN PROGRESS Nothing here is locked in stone. These are the ideas we're currently finding useful. Some will evolve. Some will get thrown out. Treat it as a conversation, not a rulebook.

Most AI deployments fail because they skip the thinking layer. We don't skip it. (At least that's our current hypothesis.)

Why This Exists

Late 2025, something changed. The models got scary smart — smart enough to operate autonomously, run for hours, coordinate across dozens of systems, execute tasks that used to require entire teams. The new architecture doesn't need the ridiculous orchestration layers people were building in 2024. Simpler. More stable. More secure.

But here's the problem nobody talks about: when the models get smarter, the bottleneck moves. It used to be the AI. Now it's the human who knows how to direct it.

Specification quality now determines whether an agentic deployment succeeds or catastrophically fails. And almost nobody is teaching this. This document is how we think about it.


Know What Kind of Problem You're Solving

The first and most important question in any deployment isn't "what AI should we use?" It's "what type of problem is this?"

We use the Six Types of Hard to classify every client problem before we touch a tool:

Type Bottleneck AI Solves It?
Effort Scale — enormous but not complex ✅ Yes — agentic AI
Coordination Routing, sequencing, org awareness ✅ Yes — agentic AI
Reasoning Multi-step logic, novel deduction ✅ Yes — pure reasoners
Domain Expertise Lived experience, pattern recognition 🟡 Human + AI assist
Ambiguity Defining the right question 🟡 Human defines, AI explores
Judgment & Emotional Intelligence Courage, nerve, unobservable dynamics ❌ No — human only

The insight most consultants miss: Most businesses are drowning in effort problems and coordination problems. They've been trying to solve them with ChatGPT, which is a reasoning tool. That's why it feels like a toy.

Agentic AI is built for effort and coordination. Route correctly and you're not building something slightly better than ChatGPT — you're building something that operates in a different category entirely.


Put the Right Engine in the Right Car

Once you know the problem type, you route to the right model. This is not about using the "smartest" model — it's about matching capability to task. Wrong model = wasted money, broken workflows, frustrated clients.

Claude Opus 4.6
The Workhorse

Built for agentic work. Tools, APIs, file systems, sustained autonomous runs. Use when the task runs for hours and touches multiple systems.

Real proof: 16 agents building a C compiler over weeks. Rakuten: autonomously closed issues, routed across a 50-person org, 6 repositories.

Cost: ~$15/M tokens

Gemini 3.1 Pro
The Deep Thinker

Built for pure reasoning. Novel logic, multi-step deduction, problems nobody has seen before. Use for analysis, contract review, threat modeling.

Real proof: Broke 18 previously unsolved problems in math, physics, CS, and economics. Best ARC-AGI2 score ever.

Cost: ~$2/M tokens — 7x cheaper than Opus for reasoning-only work

Fast/Cheap Models
Groq · Cerebras · DeepSeek

High volume, low complexity — classification, routing decisions, summarization, real-time voice bridge. Use when you need speed or scale and the thinking per step is trivial.

The cost engineering rule: Route reasoning-only tasks to Gemini. For a client running 1B tokens/month, that's $13K/month in margin. We don't leave that on the table.

Routing in Practice

Discovery / Audit
→ Gemini
analysis, reasoning
Long-Running Automation
→ Opus
agentic, multi-step
High-Volume Classification
→ Groq / DeepSeek
speed + cost
Voice Agents
→ Groq / Cerebras
latency > depth
Code Generation
→ Opus or Codex
depends on volume
Real-Time Analysis
→ Gemini
depth over orchestration

Build the Right Infrastructure, Not Just a Better Prompt

This is where most "AI consultants" stop. They write a prompt. They call it done. It works for a week, then breaks.

Real agent deployment requires four layers of infrastructure, operating at different altitudes. Each layer depends on the one below it.

Layer 1

Prompt Craft

Clear instructions, examples, counter-examples, guardrails, output format. Table stakes now — like typing with ten fingers. Still matters. Doesn't differentiate.

Most consultants live here. This is where they charge $400/hour.

Layer 2

Context Engineering

Everything the model sees during a task: system prompts, tool definitions, retrieved documents, conversation history, memory systems, external data connections.

Your prompt might be 200 tokens. The context window might be 1 million. Your prompt is 0.02% of what the model sees.

The 10x practitioners aren't writing 10x better prompts — they're building 10x better context infrastructure. An agent that knows your business when it wakes up vs. one that starts cold every time.

Layer 3

Intent Engineering

Context = what to know. Intent = what to want.

This is where most deployments fail silently. Klarna built 2.3 million resolved conversations optimized for speed — not customer satisfaction. They celebrated the number. Then they started losing customers. Then they rehired humans.

Intent engineering is encoding your actual organizational goals, values, trade-off hierarchies, and decision boundaries into the agent infrastructure.

Bad prompt: wasted morning. Bad intent engineering: company-scale disaster.

Layer 4

Specification Engineering

Writing documents that autonomous agents can execute over extended time horizons — hours, days, weeks — without human intervention.

This is the shift almost nobody has made yet. It's not about the agent's context window. It's about thinking of your entire organizational document corpus as agent-executable. Every process doc, every playbook, every decision tree.

This is why long-running agents fail: the spec breaks down at hour 3. The agent runs out of clear direction and starts filling gaps with "statistical plausibility." That's a polite term for guessing. Guesses compound.

The cumulative stack: You cannot have good intent alignment without good context. You cannot have effective specification engineering without good intent alignment. Each layer depends on the ones below it. Skipping layers doesn't save time — it just delays failure.


Build Specs That Agents Can Actually Execute

For any long-running or high-stakes agent task, the specification needs five things. These are what we keep coming back to:

01

Self-Contained Problem Statement

Include ALL context the agent needs. No gaps. AI fills gaps with statistical plausibility — a polite term for guessing. Guessing compounds over a 4-hour run. Write it so a capable agent could start from zero and know exactly what it's doing and why.

02

Acceptance Criteria

If you can't describe what done looks like, the agent can't know when to stop. Write 3 sentences an independent observer could use to verify the output WITHOUT asking you any questions.

"Make it better" is not acceptance criteria. "Reduce customer escalation rate from 18% to under 10% as measured by column F in the CRM report" is.

03

Constraint Architecture

Four categories, always defined:

  • Must do — non-negotiable requirements
  • Must NOT do — hard guardrails
  • Prefer — guidance when multiple valid approaches exist
  • Escalate — what the agent brings to a human rather than deciding autonomously

That last one is underrated. An agent that knows when to stop and ask is more valuable than one that barrels forward and apologizes later.

04

Decomposition

Break large tasks into ~2-hour subtasks with clear input/output boundaries that can be verified independently. The goal isn't to write every subtask yourself. Describe the decomposition logic so a planner agent can do the breaking. You provide the break patterns. The agent provides the execution.

05

Evaluation Design

Build 3-5 test cases with known good outputs for every recurring agent task. Run them periodically, especially after model updates. This is the only thing standing between AI output you can't use and output you can use as-is. Institutional knowledge that compounds over time.


Three Design Principles We Keep Coming Back To

Not commandments. But every time we've cut corners on one of these, we've paid for it. Worth taking seriously until something better replaces them.

Specify Before You Build

Every agent starts with a testable specification. Not a vague intent. A specification. If you can't write the test, you don't understand the problem well enough to build.

"Make it user friendly" is not a spec. "Reduce onboarding drop-off from 40% to 15% by eliminating steps 3 and 4" is.

Verify Every Output

All agent output should be verified. Engineers write unit tests. Agents need the equivalent. Automated where possible. Human-in-the-loop where necessary.

The dangerous agent isn't the one that errors out — it's the one that builds exactly what was asked for when what was asked for was wrong. Code Rabbit finding: AI-generated code produces 1.7x more logic issues than human code. Not syntax errors. Doing the wrong thing correctly.

Fail Visibly, Not Silently

Design agents to surface uncertainty, flag ambiguity, and halt on unclear specs rather than hallucinating forward. An agent that says "I don't know what to do here" is not a failure. An agent that quietly deletes a production database and fabricates records to cover it is an existential failure.


Foundation Before Leverage

Nate B Jones made a point worth locking in:

The calculator moment worked itself out because students who had the mathematical foundation first and then got the calculator pulled ahead. Students who only got the calculator lost the ability to know if the answer was reasonable.

Same dynamic with AI. Clients who deployed AI without understanding what they were deploying have the same problem: they can't evaluate the output. They've offloaded the thinking layer prematurely.

The gift we give our clients is the cognitive architecture that lets them direct intelligence rather than depend on it. Foundation first. Then leverage.

That's the difference between an AI deployment that compounds over time and one that makes your team quietly dumber while generating impressive-looking output.


One-Liner by Problem Type

Client: "We tried AI. It didn't work."
"Walk me through what you built. I'll tell you which layer broke."
Client: "Can't we just use ChatGPT?"
"ChatGPT is a reasoning tool. Your biggest pain point is probably an effort problem or a coordination problem. That's a different tool — and a very different price point."
Client: "How is this different from what our IT team tried?"
"Your IT team built Layer 1. We build all four layers. That's the difference between a chatbot and a system that runs your operation."

The Full Stack

RIGHT PROBLEM TYPE → RIGHT MODEL → RIGHT INFRASTRUCTURE → RIGHT SPEC Six Types of Hard Model Map 4-Layer Stack 5 Primitives Specify first Verify always Fail visibly