The Context Problem with Agentic AI

Why AI-Assisted Development Needs Curated Context at Scale

Jan 21, 2026

[EDIT: February 2026]

Since publishing this article, I’ve begun building Context Synth - an open-source project exploring a context synthesis approach that combines multiple sources into a single structured markdown file.

The project is currently early-stage (v0.0.1-alpha), with a stable workflow targeted for v0.1.

It’s 2026 now, and the software engineering industry is in the midst of a (not so) quiet AI revolution. The latest focus is Agentic AI - systems designed to assist, reason, plan, and execute on increasingly complex tasks with minimal human intervention. These systems often go beyond following instructions and make architectural decisions through deep reasoning. They’re being heralded as the future of development: AI that “just figures it out”, removing humans from the tedious parts of software creation.

I think this approach is undeniably powerful in certain circumstances, but there’s an issue with how broadly it’s being applied. Inference-heavy systems and higher levels of automation are increasingly being treated as a near-complete solution to software complexity, but they simply aren’t.

If leveraged too heavily in the wrong situations, agentic AI will create far more technical debt and organizational dysfunction than any collection of human engineers could create alone.

This isn’t a critique of model capability. Agentic systems are powerful and most certainly have their place. The issue is which models are being used where, and how broadly inference-heavy approaches are being applied.

I believe there’s a more durable path forward - one that keeps the upside of AI assistance without taking on much of the risk of boundless autonomy.

To find a solution however, we first need to diagnose the problem.

The Allure of Agency

Agentic engineering is simple in concept: delegate multiple steps of the development process to AI systems, reducing the need for continuous human intervention. Tools like Claude Code (utilizing Claude’s family of models) exemplify this approach. They generate code, reason about architecture, infer intent, and propagate decisions across files and systems.

And in isolation? It’s phenomenal.

Think of it like having a senior engineer in the room: opinionated, experienced, willing to think outside the box and challenge your assumptions. When you’re prototyping, experimenting, or building something solo, this is exactly what you want. The AI narrates its reasoning beautifully, fills in gaps you didn’t know existed, and moves fast.

For early-stage startups, this is often the correct choice. Speed matters more than formalism. Ambiguity is cheap. The same person who writes the context usually consumes the output. Assumption-heavy systems are incredibly valuable in this phase.

The problem emerges when this same approach scales beyond its natural boundaries. This “senior engineer” is making architectural decisions behind the scenes - decisions you didn’t explicitly authorize, decisions that aren’t necessarily documented in shared artifacts, decisions that will eventually multiply across teams and timescales.

What works brilliantly in a single-team, fast-feedback environment will likely become a liability when context must survive handoffs, turnover, and detailed inspection.

The Hidden Cost of Inference

Imagine multiple teams working on loosely related parts of the same system. Each uses agentic tooling to make architectural and implementation decisions in real time. Locally, progress looks strong. Decisions may even be documented within team boundaries.

But here’s the problem: what’s missing is the reasoning path that led there.

Much of what occurred and how they arrived at their conclusions is not actually documented in shared artifacts, and even when it is, tools that rely heavily on deep reasoning still tend to draw different conclusions based on partial or evolving context.

To clarify - when I say “context” here, I don’t just mean your code files or chat history. I mean the invisible constraints - the business domain knowledge not normally found in a codebase, the architectural non-negotiables, and the “why” behind previous decisions. When this information isn’t explicit, the model infers more, and different models tend to infer differently by default.

Ask yourself this litmus test:

Can two teams, six months apart, with partial context, reach similar outcomes and explain why?

I suspect assumption-heavy systems will fail this test quietly.

Here’s the dangerous part: the divergence won’t show up immediately. Codebases drift. Architectural patterns fragment. Tradeoff decisions vary across teams. Integration problems appear long after the decisions that caused them.

This tech debt will be invisible, until it’s not.

Many creative inferences become an undocumented architectural decision. Many “reasoned” choices a potential divergence point. Why? Because the reasoning happens behind the scenes (inside the model, not in your artifacts), so you won’t be able to inspect it, reproduce it, or trust it across contexts.

In addition, there’s also the cost factor that businesses need to consider: deep reasoning is expensive. Take Claude Opus, for example - it prioritizes sophisticated inference, but consumes significantly more tokens per interaction than more constrained approaches. While highly valuable in the earlier stages of development, its usage will eat away at your bottom line if used beyond its intended scope.

To top it off, the total cost of ownership for agentic AI at scale includes far more than just the token bill - it also includes the hidden labor cost of reconciling divergent outputs that can easily dwarf the API expenses.

The Phase Change: From Prototyping to Production

Now, this isn’t a moral judgment about agentic AI. I’m simply looking at a phase change in how systems scale.

Early stage:

Small teams
Shared mental models
Informal context
Fast feedback
Limited blast radius

Assumptions help here - awesome 👍

Organizational scale:

Many teams
Partial context
Artifact reuse across time and people
Asynchronous consumption
Accountability and governance

Assumptions become a risk - no thanks 👎

Nothing about this transition implies weaker engineers or worse models - it’s the environment itself that’s different.

At the early stage, you want models to infer, challenge, and fill in gaps. At organizational scale, you’ll need models to respect boundaries, follow declared rules, and produce more consistent outputs.

Production failures almost always trace back to misaligned assumptions, not intelligence.

What Production Systems Generally Need

Once requirements and constraints are clear, production systems tend to need something different: compliance over creativity. Predictability over possibility. Executional correctness over speculative reasoning.

Literalist, rule-following models quietly win here - provided the rules are actually clear.

Inference-heavy models tend to interpret ambiguity, challenge assumptions, and fill gaps. They behave like thoughtful collaborators when problems are underspecified.

Constraint-driven models tend to behave differently in day-to-day engineering use. They behave more like the lawyer of LLMs - rigid, precise, and far less willing to reinterpret intent once the rules are set. As of writing, I’m finding Gemini Flash to be the clearest example of this archetype, though I recognize that model behaviors are fluid and likely to shift over time.

Both are valid design philosophies. The optimal solution depends entirely on the scenario in front of you.

For budget-conscious businesses (which is most), this matters. If you’re running hundreds or thousands of AI-assisted interactions per day across your engineering team, the cost difference between inference-heavy and constraint-driven approaches compounds quickly.

Training engineers to toggle based on task type (expensive inference for ambiguity, cheaper execution for defined work) can dramatically reduce AI spend without sacrificing quality.

It’s also worth noting that a lot of recent work on context engineering is moving in this direction. The focus seems to be on reducing token usage, managing agent state, and making deep reasoning cheaper and more efficient (Anthropic’s recent work on effective context engineering is a good example).

This is valuable, but it doesn’t fully address the issue I keep running into: optimized reasoning is still reasoning in places where consistency matters more than inference. In many cases, the real mistake isn’t cost inefficiency so much as relying on inference once scope and context are already well-defined.

The Right Tool for the Right Task

To illustrate this distinction I’m making, let me share a simple example:

Imagine designing a reusable internal library. Early on, requirements are ambiguous. Different teams want slightly different things. API boundaries are unclear. Tradeoffs haven’t been resolved. At this stage, leveraging higher levels of inference is incredibly useful. You want help exploring the design space and pressure-testing assumptions.

Once the library stabilizes, the problem changes. The API is defined. Constraints are explicit. Multiple teams now depend on consistent behavior. Continued inference becomes a liability.

For adoption and ongoing use, constraint-driven models are often a better fit. Their literalism helps enforce consistency. Similar inputs produce similar outputs. Architectural intent is preserved rather than reinterpreted.

As you can see, thinking in terms of “better” or “worse” is the wrong approach - different models are better at different steps along the process.

Inference helps create the solution (0 > 1). Constraint-following behavior helps it scale (1 > N).

A Missing Layer: Explicit Context

Here’s the tricky part… constraint-driven models only work well if you give them the right constraints, and I believe we’re still missing a critical architectural layer to this: explicit context curation.

Now, let’s face it - context is rarely maintained well. Whether it’s in spreadsheets, design docs, third-party integrations, or just implicit knowledge between a number of people, it’s rarely consistent and is often contradictory.

As a result, AI tools are forced to infer this context from the limited information it has access to in your codebase. As much as I’d love for this to be solved with a simple markdown file or two, that simply isn’t going to cut it.

Protocols like Model Context Protocol (MCP) help standardize how context is delivered to models, but they do not address the problem of determining which context is authoritative, how conflicts are resolved, or when inference must be actively discouraged.

In my view, the gap here is that we’ve yet to treat context synthesis as a deliberate step: actively assembling, weighting, and reconciling multiple sources of context into a task-specific worldview before the model ever starts executing.

Think of it as constructing a task-specific “truth environment” for the model before execution. You’re declaring what matters rather than asking the AI to guess:

Domain definitions and business invariants (what terms mean)
Architectural decisions and constraints (what’s allowed)
Feature specifications and acceptance criteria (what to build now)
- For what it’s worth, Github Spec Kit is already great at this
Execution instructions for the current task (how to build it)

More than just documentation, it’s a call for structured, weighted, authoritative context that resolves conflicts deterministically.

As you can imagine, not all context is equal. Architectural decisions must outrank local preferences. Domain invariants must outrank convenience, and global knowledge used only when it doesn’t conflict. Without an explicit hierarchy of authority, “context” remains advisory, and advisory context still encourages the model to infer.

When you pair a constraint-driven model with explicitly curated context, I believe you’ll get:

More consistent output
Consistent architectural decisions across teams and time
Portability across projects and teams
Lower operational cost

Ultimately, the goal is to route execution through systematized knowledge rather than leaving the model to infer that knowledge each time.

What Comes Next

This leads to a natural question: how do you actually implement this? How does one structure and curate context at scale?

Right now, the industry still lacks a standard for this. We have plenty of tools for “prompt engineering” and retrieval, but we don’t yet have a discipline for context synthesis - the architectural layer that actively assembles, weighs, and enforces constraints before the model ever sees a prompt.

The path forward isn’t just smarter autonomous agents, but tighter constraints, explicit context, deliberate model selection, and humans carving the path for AI to follow (something I wrote about recently).

This is a problem space I’m invested in exploring in 2026.

While I don’t have all the answers yet, I am convinced one of the next great leaps in AI-accelerated software engineering will come from stricter context. We need to stop hoping the model “figures it out” and start building the environments where it won’t fail.

the spiral essays

Discussion about this post

Ready for more?