How We Think About
Deploying AI Agents
Most AI deployments fail because they skip the thinking layer. We don't skip it. (At least that's our current hypothesis.)
Why This Exists
Late 2025, something changed. The models got scary smart — smart enough to operate autonomously, run for hours, coordinate across dozens of systems, execute tasks that used to require entire teams. The new architecture doesn't need the ridiculous orchestration layers people were building in 2024. Simpler. More stable. More secure.
But here's the problem nobody talks about: when the models get smarter, the bottleneck moves. It used to be the AI. Now it's the human who knows how to direct it.
Specification quality now determines whether an agentic deployment succeeds or catastrophically fails. And almost nobody is teaching this. This document is how we think about it.
Know What Kind of Problem You're Solving
The first and most important question in any deployment isn't "what AI should we use?" It's "what type of problem is this?"
We use the Six Types of Hard to classify every client problem before we touch a tool:
| Type | Bottleneck | AI Solves It? |
|---|---|---|
| Effort | Scale — enormous but not complex | ✅ Yes — agentic AI |
| Coordination | Routing, sequencing, org awareness | ✅ Yes — agentic AI |
| Reasoning | Multi-step logic, novel deduction | ✅ Yes — pure reasoners |
| Domain Expertise | Lived experience, pattern recognition | 🟡 Human + AI assist |
| Ambiguity | Defining the right question | 🟡 Human defines, AI explores |
| Judgment & Emotional Intelligence | Courage, nerve, unobservable dynamics | ❌ No — human only |
The insight most consultants miss: Most businesses are drowning in effort problems and coordination problems. They've been trying to solve them with ChatGPT, which is a reasoning tool. That's why it feels like a toy.
Agentic AI is built for effort and coordination. Route correctly and you're not building something slightly better than ChatGPT — you're building something that operates in a different category entirely.
Put the Right Engine in the Right Car
Once you know the problem type, you route to the right model. This is not about using the "smartest" model — it's about matching capability to task. Wrong model = wasted money, broken workflows, frustrated clients.
Built for agentic work. Tools, APIs, file systems, sustained autonomous runs. Use when the task runs for hours and touches multiple systems.
Real proof: 16 agents building a C compiler over weeks. Rakuten: autonomously closed issues, routed across a 50-person org, 6 repositories.
Cost: ~$15/M tokens
Built for pure reasoning. Novel logic, multi-step deduction, problems nobody has seen before. Use for analysis, contract review, threat modeling.
Real proof: Broke 18 previously unsolved problems in math, physics, CS, and economics. Best ARC-AGI2 score ever.
Cost: ~$2/M tokens — 7x cheaper than Opus for reasoning-only work
High volume, low complexity — classification, routing decisions, summarization, real-time voice bridge. Use when you need speed or scale and the thinking per step is trivial.
The cost engineering rule: Route reasoning-only tasks to Gemini. For a client running 1B tokens/month, that's $13K/month in margin. We don't leave that on the table.
Routing in Practice
Build the Right Infrastructure, Not Just a Better Prompt
This is where most "AI consultants" stop. They write a prompt. They call it done. It works for a week, then breaks.
Real agent deployment requires four layers of infrastructure, operating at different altitudes. Each layer depends on the one below it.
Prompt Craft
Clear instructions, examples, counter-examples, guardrails, output format. Table stakes now — like typing with ten fingers. Still matters. Doesn't differentiate.
Most consultants live here. This is where they charge $400/hour.
Context Engineering
Everything the model sees during a task: system prompts, tool definitions, retrieved documents, conversation history, memory systems, external data connections.
Your prompt might be 200 tokens. The context window might be 1 million. Your prompt is 0.02% of what the model sees.
The 10x practitioners aren't writing 10x better prompts — they're building 10x better context infrastructure. An agent that knows your business when it wakes up vs. one that starts cold every time.
Intent Engineering
Context = what to know. Intent = what to want.
This is where most deployments fail silently. Klarna built 2.3 million resolved conversations optimized for speed — not customer satisfaction. They celebrated the number. Then they started losing customers. Then they rehired humans.
Intent engineering is encoding your actual organizational goals, values, trade-off hierarchies, and decision boundaries into the agent infrastructure.
Bad prompt: wasted morning. Bad intent engineering: company-scale disaster.
Specification Engineering
Writing documents that autonomous agents can execute over extended time horizons — hours, days, weeks — without human intervention.
This is the shift almost nobody has made yet. It's not about the agent's context window. It's about thinking of your entire organizational document corpus as agent-executable. Every process doc, every playbook, every decision tree.
This is why long-running agents fail: the spec breaks down at hour 3. The agent runs out of clear direction and starts filling gaps with "statistical plausibility." That's a polite term for guessing. Guesses compound.
The cumulative stack: You cannot have good intent alignment without good context. You cannot have effective specification engineering without good intent alignment. Each layer depends on the ones below it. Skipping layers doesn't save time — it just delays failure.
Build Specs That Agents Can Actually Execute
For any long-running or high-stakes agent task, the specification needs five things. These are what we keep coming back to:
Self-Contained Problem Statement
Include ALL context the agent needs. No gaps. AI fills gaps with statistical plausibility — a polite term for guessing. Guessing compounds over a 4-hour run. Write it so a capable agent could start from zero and know exactly what it's doing and why.
Acceptance Criteria
If you can't describe what done looks like, the agent can't know when to stop. Write 3 sentences an independent observer could use to verify the output WITHOUT asking you any questions.
"Make it better" is not acceptance criteria. "Reduce customer escalation rate from 18% to under 10% as measured by column F in the CRM report" is.
Constraint Architecture
Four categories, always defined:
- Must do — non-negotiable requirements
- Must NOT do — hard guardrails
- Prefer — guidance when multiple valid approaches exist
- Escalate — what the agent brings to a human rather than deciding autonomously
That last one is underrated. An agent that knows when to stop and ask is more valuable than one that barrels forward and apologizes later.
Decomposition
Break large tasks into ~2-hour subtasks with clear input/output boundaries that can be verified independently. The goal isn't to write every subtask yourself. Describe the decomposition logic so a planner agent can do the breaking. You provide the break patterns. The agent provides the execution.
Evaluation Design
Build 3-5 test cases with known good outputs for every recurring agent task. Run them periodically, especially after model updates. This is the only thing standing between AI output you can't use and output you can use as-is. Institutional knowledge that compounds over time.
Three Design Principles We Keep Coming Back To
Not commandments. But every time we've cut corners on one of these, we've paid for it. Worth taking seriously until something better replaces them.
Specify Before You Build
Every agent starts with a testable specification. Not a vague intent. A specification. If you can't write the test, you don't understand the problem well enough to build.
"Make it user friendly" is not a spec. "Reduce onboarding drop-off from 40% to 15% by eliminating steps 3 and 4" is.
Verify Every Output
All agent output should be verified. Engineers write unit tests. Agents need the equivalent. Automated where possible. Human-in-the-loop where necessary.
The dangerous agent isn't the one that errors out — it's the one that builds exactly what was asked for when what was asked for was wrong. Code Rabbit finding: AI-generated code produces 1.7x more logic issues than human code. Not syntax errors. Doing the wrong thing correctly.
Fail Visibly, Not Silently
Design agents to surface uncertainty, flag ambiguity, and halt on unclear specs rather than hallucinating forward. An agent that says "I don't know what to do here" is not a failure. An agent that quietly deletes a production database and fabricates records to cover it is an existential failure.
Foundation Before Leverage
Nate B Jones made a point worth locking in:
The calculator moment worked itself out because students who had the mathematical foundation first and then got the calculator pulled ahead. Students who only got the calculator lost the ability to know if the answer was reasonable.
Same dynamic with AI. Clients who deployed AI without understanding what they were deploying have the same problem: they can't evaluate the output. They've offloaded the thinking layer prematurely.
The gift we give our clients is the cognitive architecture that lets them direct intelligence rather than depend on it. Foundation first. Then leverage.
That's the difference between an AI deployment that compounds over time and one that makes your team quietly dumber while generating impressive-looking output.