the spiral essays

The Context Problem with Agentic AI

Anthony Martinovic — Wed, 21 Jan 2026 00:32:53 GMT

It’s 2026 now, and the software engineering industry is in the midst of a (not so) quiet AI revolution. The latest focus is Agentic AI - systems designed to assist, reason, plan, and execute on increasingly complex tasks with minimal human intervention. These systems often go beyond following instructions and make architectural decisions through deep reasoning. They’re being heralded as the future of development: AI that “just figures it out”, removing humans from the tedious parts of software creation.

I think this approach is undeniably powerful in certain circumstances, but there’s an issue with how broadly it’s being applied. Inference-heavy systems and higher levels of automation are increasingly being treated as a near-complete solution to software complexity, but they simply aren’t.

If leveraged too heavily in the wrong situations, agentic AI will create far more technical debt and organizational dysfunction than any collection of human engineers could create alone.

This isn’t a critique of model capability. Agentic systems are powerful and most certainly have their place. The issue is which models are being used where, and how broadly inference-heavy approaches are being applied.

I believe there’s a more durable path forward - one that keeps the upside of AI assistance without taking on much of the risk of boundless autonomy.

To find a solution however, we first need to diagnose the problem.

The Allure of Agency

Agentic engineering is simple in concept: delegate multiple steps of the development process to AI systems, reducing the need for continuous human intervention. Tools like Claude Code (utilizing Claude’s family of models) exemplify this approach. They generate code, reason about architecture, infer intent, and propagate decisions across files and systems.

And in isolation? It’s phenomenal.

Think of it like having a senior engineer in the room: opinionated, experienced, willing to think outside the box and challenge your assumptions. When you’re prototyping, experimenting, or building something solo, this is exactly what you want. The AI narrates its reasoning beautifully, fills in gaps you didn’t know existed, and moves fast.

For early-stage startups, this is often the correct choice. Speed matters more than formalism. Ambiguity is cheap. The same person who writes the context usually consumes the output. Assumption-heavy systems are incredibly valuable in this phase.

The problem emerges when this same approach scales beyond its natural boundaries. This “senior engineer” is making architectural decisions behind the scenes - decisions you didn’t explicitly authorize, decisions that aren’t necessarily documented in shared artifacts, decisions that will eventually multiply across teams and timescales.

What works brilliantly in a single-team, fast-feedback environment will likely become a liability when context must survive handoffs, turnover, and detailed inspection.

The Hidden Cost of Inference

Imagine multiple teams working on loosely related parts of the same system. Each uses agentic tooling to make architectural and implementation decisions in real time. Locally, progress looks strong. Decisions may even be documented within team boundaries.

But here’s the problem: what’s missing is the reasoning path that led there.

Much of what occurred and how they arrived at their conclusions is not actually documented in shared artifacts, and even when it is, tools that rely heavily on deep reasoning still tend to draw different conclusions based on partial or evolving context.

To clarify - when I say “context” here, I don’t just mean your code files or chat history. I mean the invisible constraints - the business domain knowledge not normally found in a codebase, the architectural non-negotiables, and the “why” behind previous decisions. When this information isn’t explicit, the model infers more, and different models tend to infer differently by default.

Ask yourself this litmus test:

Can two teams, six months apart, with partial context, reach similar outcomes and explain why?

I suspect assumption-heavy systems will fail this test quietly.

Here’s the dangerous part: the divergence won’t show up immediately. Codebases drift. Architectural patterns fragment. Tradeoff decisions vary across teams. Integration problems appear long after the decisions that caused them.

This tech debt will be invisible, until it’s not.

Many creative inferences become an undocumented architectural decision. Many “reasoned” choices a potential divergence point. Why? Because the reasoning happens behind the scenes (inside the model, not in your artifacts), so you won’t be able to inspect it, reproduce it, or trust it across contexts.

In addition, there’s also the cost factor that businesses need to consider: deep reasoning is expensive. Take Claude Opus, for example - it prioritizes sophisticated inference, but consumes significantly more tokens per interaction than more constrained approaches. While highly valuable in the earlier stages of development, its usage will eat away at your bottom line if used beyond its intended scope.

To top it off, the total cost of ownership for agentic AI at scale includes far more than just the token bill - it also includes the hidden labor cost of reconciling divergent outputs that can easily dwarf the API expenses.

The Phase Change: From Prototyping to Production

Now, this isn’t a moral judgment about agentic AI. I’m simply looking at a phase change in how systems scale.

Early stage:

Small teams
Shared mental models
Informal context
Fast feedback
Limited blast radius

Assumptions help here - awesome 👍

Organizational scale:

Many teams
Partial context
Artifact reuse across time and people
Asynchronous consumption
Accountability and governance

Assumptions become a risk - no thanks 👎

Nothing about this transition implies weaker engineers or worse models - it’s the environment itself that’s different.

At the early stage, you want models to infer, challenge, and fill in gaps. At organizational scale, you’ll need models to respect boundaries, follow declared rules, and produce more consistent outputs.

Production failures almost always trace back to misaligned assumptions, not intelligence.

What Production Systems Generally Need

Once requirements and constraints are clear, production systems tend to need something different: compliance over creativity. Predictability over possibility. Executional correctness over speculative reasoning.

Literalist, rule-following models quietly win here - provided the rules are actually clear.

Inference-heavy models tend to interpret ambiguity, challenge assumptions, and fill gaps. They behave like thoughtful collaborators when problems are underspecified.

Constraint-driven models tend to behave differently in day-to-day engineering use. They behave more like the lawyer of LLMs - rigid, precise, and far less willing to reinterpret intent once the rules are set. As of writing, I’m finding Gemini Flash to be the clearest example of this archetype, though I recognize that model behaviors are fluid and likely to shift over time.

Both are valid design philosophies. The optimal solution depends entirely on the scenario in front of you.

For budget-conscious businesses (which is most), this matters. If you’re running hundreds or thousands of AI-assisted interactions per day across your engineering team, the cost difference between inference-heavy and constraint-driven approaches compounds quickly.

Training engineers to toggle based on task type (expensive inference for ambiguity, cheaper execution for defined work) can dramatically reduce AI spend without sacrificing quality.

It’s also worth noting that a lot of recent work on context engineering is moving in this direction. The focus seems to be on reducing token usage, managing agent state, and making deep reasoning cheaper and more efficient (Anthropic’s recent work on effective context engineering is a good example).

This is valuable, but it doesn’t fully address the issue I keep running into: optimized reasoning is still reasoning in places where consistency matters more than inference. In many cases, the real mistake isn’t cost inefficiency so much as relying on inference once scope and context are already well-defined.

The Right Tool for the Right Task

To illustrate this distinction I’m making, let me share a simple example:

Imagine designing a reusable internal library. Early on, requirements are ambiguous. Different teams want slightly different things. API boundaries are unclear. Tradeoffs haven’t been resolved. At this stage, leveraging higher levels of inference is incredibly useful. You want help exploring the design space and pressure-testing assumptions.

Once the library stabilizes, the problem changes. The API is defined. Constraints are explicit. Multiple teams now depend on consistent behavior. Continued inference becomes a liability.

For adoption and ongoing use, constraint-driven models are often a better fit. Their literalism helps enforce consistency. Similar inputs produce similar outputs. Architectural intent is preserved rather than reinterpreted.

As you can see, thinking in terms of “better” or “worse” is the wrong approach - different models are better at different steps along the process.

Inference helps create the solution (0 > 1). Constraint-following behavior helps it scale (1 > N).

A Missing Layer: Explicit Context

Here’s the tricky part… constraint-driven models only work well if you give them the right constraints, and I believe we’re still missing a critical architectural layer to this: explicit context curation.

Now, let’s face it - context is rarely maintained well. Whether it’s in spreadsheets, design docs, third-party integrations, or just implicit knowledge between a number of people, it’s rarely consistent and is often contradictory.

As a result, AI tools are forced to infer this context from the limited information it has access to in your codebase. As much as I’d love for this to be solved with a simple markdown file or two, that simply isn’t going to cut it.

Protocols like Model Context Protocol (MCP) help standardize how context is delivered to models, but they do not address the problem of determining which context is authoritative, how conflicts are resolved, or when inference must be actively discouraged.

In my view, the gap here is that we’ve yet to treat context synthesis as a deliberate step: actively assembling, weighting, and reconciling multiple sources of context into a task-specific worldview before the model ever starts executing.

Think of it as constructing a task-specific “truth environment” for the model before execution. You’re declaring what matters rather than asking the AI to guess:

Domain definitions and business invariants (what terms mean)
Architectural decisions and constraints (what’s allowed)
Feature specifications and acceptance criteria (what to build now)
- For what it’s worth, Github Spec Kit is already great at this
Execution instructions for the current task (how to build it)

More than just documentation, it’s a call for structured, weighted, authoritative context that resolves conflicts deterministically.

As you can imagine, not all context is equal. Architectural decisions must outrank local preferences. Domain invariants must outrank convenience, and global knowledge used only when it doesn’t conflict. Without an explicit hierarchy of authority, “context” remains advisory, and advisory context still encourages the model to infer.

When you pair a constraint-driven model with explicitly curated context, I believe you’ll get:

More consistent output
Consistent architectural decisions across teams and time
Portability across projects and teams
Lower operational cost

Ultimately, the goal is to route execution through systematized knowledge rather than leaving the model to infer that knowledge each time.

What Comes Next

This leads to a natural question: how do you actually implement this? How does one structure and curate context at scale?

Right now, the industry still lacks a standard for this. We have plenty of tools for “prompt engineering” and retrieval, but we don’t yet have a discipline for context synthesis - the architectural layer that actively assembles, weighs, and enforces constraints before the model ever sees a prompt.

The path forward isn’t just smarter autonomous agents, but tighter constraints, explicit context, deliberate model selection, and humans carving the path for AI to follow (something I wrote about recently).

This is a problem space I’m invested in exploring in 2026.

While I don’t have all the answers yet, I am convinced one of the next great leaps in AI-accelerated software engineering will come from stricter context. We need to stop hoping the model “figures it out” and start building the environments where it won’t fail.

[P340000000]

Architecture, Specification, Execution: A Paradigm for AI-Accelerated Development

Anthony Martinovic — Mon, 10 Nov 2025 01:15:11 GMT

AI tools have been around for a minute now, and while “vibe coding” seemed fun for a moment, it’s becoming clear (at least to me) that it hasn’t quite lived up to the hype.

Want to get “something” created quickly? Sure, vibe away. But let’s be real - most of us build specific products with ever-changing requirements, and there comes a point where “vibing” gets you nowhere fast. Instead, what we need is a stable foundation that maintains velocity and cuts through the noise.

You might question if AI is living up to the hype. Honestly? I think it is - but not through vibe coding. I’ve spent enough time tinkering with these tools to know they’re powerful, but we engineers need a paradigm to actually harness that power and condense that week’s worth of work into a day (like we were promised).

In this post, I’m going to share a paradigm that’s been working for me. To be clear: I’m not advocating for any particular products - Copilot, Kiro, Cursor, they’re all amazing. What I’m offering is an approach that works regardless of which tools you choose, delivering the compounding returns vibe coding never reaches.

Here’s the core principle: carve the path for AI to follow, don’t walk it yourself.

Your job as the engineer is to set direction, establish constraints, and define success. AI’s job is to execute within those boundaries. Mix these roles and you’ll just muddy the waters.

This paradigm builds on spec-driven development, and it consists of three pillars:

Architecture - Document the decisions that shape your system
Specification - Define the features within those constraints
Execution - Prompt and let it run

Here’s what makes this work: it’s a recursive system that feeds itself. Your execution prompts reference your specifications, which reference your architectural decisions. Each layer builds on the previous one. Architecture sets the container, specifications define what to build inside it, and Execution is where AI does the heavy lifting - all feeding back into a cycle that gets sharper each time.

The result? An optimal environment that maximizes code shipped to production.

Ok, let’s dig in.

Architecture: Document the decisions that shape your system

Before planning features or generating code, answer this: What are the foundational principles that define this project? We need guardrails - a series of governing ideas that create a well-defined container around everything we build. The key asset here is Architecture Decision Records (ADRs) - documented decisions about your technical approach that serve as this foundation.

What belongs here?

Tech stack choices, authentication approach, API design philosophy, state management patterns, testing strategy - every architectural decision that will guide future work. Think of this as your “source of truth” - the ultimate reference point for any decision made in this environment.

Example: “We’re using Zustand for state management because our app needs global state without the boilerplate of Redux, and Zustand’s simplicity aligns with our preference for minimal abstractions.”

Why does this matter?

AI needs context, and it needs the right context. AI has access to every pattern and approach imaginable. Every time you prompt it, it could consider countless solutions - that’s far too much. The goal is to narrow that window at the start, forcing AI to work within pre-defined channels.

Having architectural decisions on record makes this clear and binding. Once it has a constitution to follow, you prevent “drift”, where each AI-generated feature uses a different pattern.

How to create them?

Communicate.

Not just with AI chatbots, but with peers and stakeholders. Nail down requirements and limitations first. The more information you can distill, the better. Compare approaches, discuss trade-offs, and determine what makes sense for your situation.

As you progress, confirm your decisions, document the rationale, and keep it clear. Remember: these form the outer bounds of the context window for every prompt when generating code, so precision matters.

Store these documents at or near the root level of your workspace. You can revise them later, but take care when modifying existing decisions.

The key takeaway

Time invested here pays dividends during feature development. AI can’t make these calls, but you can. This is where engineers are most valuable - communication, judgment, trade-offs, absorbing business context.

Everything outside these boundaries is noise that pollutes AI’s contextual awareness and leads to suboptimal results.

Once you have the container in place, it’s time for the next phase: specifications.

Specification: Define the features within those constraints

This is where we detail what we’re actually building. Think of specs as feature-level plans that operate within the architectural boundaries you’ve set. This is Spec-Driven Development at its core, and it’s the meat of this paradigm (you’ll spend most of your development time here). Tools like GitHub’s Spec-Kit are perfect for this.

Here’s the critical relationship: ADRs set the rules, specs define the moves. Your architectural decisions have already narrowed AI’s solution space. Now, specs guide it toward the exact outcome you want. Without ADRs, AI might solve your problem five different ways across five features. With ADRs and specs working together, AI solves it consistently, following your established patterns every time.

What belongs here?

Anything relevant to your feature’s context: user stories, API contracts, UI/UX flows, edge cases, business logic, dependencies, success metrics - if it helps define what you’re building, include it.

Be as precise as possible. This might feel tedious at first, but remember: the goal isn’t instant code - it’s an environment that consistently generates code matching your intent. Every detail you provide here saves debugging time later.

Why does this matter?

Specs give AI a clear target within known boundaries. The recursive system works like this: AI reads your ADRs to understand how to build, then reads your specs to understand what to build. The tighter the connection between these two layers, the better your results.

This is how you prevent drift at the feature level. ADRs prevent architectural drift (every feature uses the same patterns), while specs prevent implementation drift (every feature solves problems the way you intended). Together, they create consistency that compounds.

How to create them?

Write specs as if you’re briefing another engineer who’s already read your ADRs. Reference specific architectural decisions when relevant - “Following our ADR on state management, this feature uses Zustand for global cart state.”

Be specific about the end goal. Don’t say “build a user profile page” - describe what data it displays, how it handles loading states, what happens when data is missing, how errors surface to users. Define validation rules, edge cases, and data shapes.

If you’re integrating with external services, make data contracts explicit. Design mockups, API documentation, and screenshots from similar features all help AI understand your intent.

Keep specs in version control alongside your code. Unlike ADRs (which rarely change), specs will evolve as requirements shift.

Focus on what to build, not how to build it. That’s AI’s job. Define the destination clearly, and let AI figure out the route within your architectural constraints.

The key takeaway

If Architecture is your foundation, specs are your blueprint. This is where you translate business needs into instructions AI can execute. The clearer your specs, the less time you spend fixing AI’s output.

The magic happens in the interaction: ADRs tell AI what patterns to follow, specs tell AI what problems to solve. When both layers are clear, AI can move fast without breaking things.

You have the container, and now you have the plan. Time to execute.

Execution: Prompt and let it run

This is where your preparation pays off. With ADRs defining your architectural boundaries and specs detailing what to build, AI can now handle the implementation. This is where the velocity gains actually happen - you’re no longer writing boilerplate, wiring up APIs, or building CRUD operations. AI does the heavy lifting while you validate and guide.

Tools like Cursor and Kiro excel here, giving you AI assistance directly in your development environment.

What belongs here?

Your prompts should reference the specific spec you’re implementing, along with any relevant ADRs. The key is giving AI exactly what it needs - no more, no less.

A good prompt looks like this: “Implement the user dashboard feature from /docs/specs/dashboard-spec.md. Follow our state management patterns defined in /docs/adr/004-zustand-state.md. The component should fetch data from the /api/dashboard endpoint as specified.”

Notice what this does: it points to the spec (what to build), references the ADR (how to build it), and includes relevant context (the API endpoint). AI now has clear constraints and a clear target.

Why does this matter?

This is where the recursive system completes the loop. AI has your architectural rules (ADRs), knows what to build (specs), and now executes within those boundaries. The better your first two layers, the less intervention you need here.

When your ADRs and specs are solid, AI generates code that’s not only functional but consistent with the rest of your codebase. You’re not debugging random implementation choices - you’re validating that the output matches your intent.

How to execute effectively?

Start by pointing AI to the relevant spec and any ADRs that apply. Be explicit about file paths and references - AI can’t guess which documents matter.

Let AI complete full implementations before intervening. This is critical: if you start manually editing AI’s output mid-stream, you break its context and end up with a franken-implementation that’s half-yours, half-AI’s. Resist the urge.

Validate the output against your spec. Does it handle all the edge cases you defined? Does it follow your architectural patterns? Does it match the data contracts you specified?

If something’s wrong, don’t fix the code directly. Instead:

Check if your spec was clear enough
Refine your prompt with more specific guidance
Re-run the generation

This might feel slower initially, but it teaches you how to guide AI better and keeps the codebase consistent.

Keep your context window focused. Don’t dump your entire codebase into the prompt - reference only what’s needed for this specific feature. Clear context frequently and re-establish boundaries with each new task.

As you iterate, you’ll develop intuition for what level of detail AI needs. Some features need explicit examples, others just need the spec. This is a skill that improves with practice.

The key takeaway

You’re no longer the coder - you’re the orchestrator. Your job is to validate and guide, not implement. If AI generates something wrong, it’s usually because the spec was unclear or the prompt lacked necessary context.

The more you stay in “prompt mode” rather than “edit mode,” the faster you’ll move. Trust the system you’ve built: solid ADRs + detailed specs + clear prompts = consistent, high-quality output.

Now you have all three layers working together. Time to ship.

Closing Thoughts

The shift happening in software development isn’t about AI replacing engineers - it’s about engineers moving upstream. Your value lies in system design, communication, and judgment calls, not in typing out implementations.

This paradigm isn’t just about speed. It’s about building better, more consistent software while maintaining the clarity and intention that separates great code from functional code. When you set clear boundaries and detailed plans, AI becomes a force multiplier.

Vibe coding will always have its place for prototypes and throwaway experiments. But when you’re building something real, something that needs to scale and evolve and withstand changing requirements? You need structure. You need intention. You need a system that feeds itself.

Start with your ADRs. Define your boundaries. Write specs that leave no ambiguity. Let AI execute within the container you’ve built. Trust the process, resist the urge to intervene, and watch what compounds.

The paradigm is simple: Architecture, Specification, Execution. The principle is simpler: Carve the path for AI to follow, don’t walk it yourself.