Tools vs Agents: What Gets Lost When You Stop Reaching

xxx

Apr 15, 2026

12 min read

Static analysis is O(V+E). Agentic workflows are O(I-have-no-idea). What happened when I installed Knip, handed it to an agent, and stopped thinking about input sets.

#ai

#debugging

#nextjs

#architecture

#reflections

"Has anyone used knip?"

Bijoy posted in our internal Slack asking if anyone had used Knip. The honest truth: I'd been manually refactoring and hauling dead code out of projects for months. Agentic workflows made the codebase grow faster than I could prune it. Heavy workflows meant accumulated and leaky processes. Forgotten trash polluted my projects. I was basically doing archaeology on my own codebase, digging up layers of dead code faster than new code could bury them.

In the pre-agentic era, a tool like Knip would have been an obvious reach. I would have Googled "find unused javascript files" and landed on it in seconds. But something had shifted. When you have an AI agent that can read your entire codebase and reason about it, you stop reaching for specialized tools. You ask the agent to do it. The agent tries. It's slower, more expensive, less accurate, but the activation energy is zero. So you keep doing it. It’s like asking a brilliant philosopher to balance your accounting ledgers because they’re already sitting in your office.

Bijoy's message snapped me out of that loop. I installed Knip. Then, because I apparently haven't suffered enough, I handed it to an AI agent to orchestrate. That's where things got interesting.

# What Knip Actually Does (The Accountant)

The core mechanism: Knip parses every source file into an Abstract Syntax Tree (AST). Think of an AST as a structural blueprint that turns raw text into a data tree tools can query. To generate these blueprints, Knip relies on oxc-parser, an incredibly fast JavaScript toolchain written in Rust. From each tree, it pulls import specifiers and export declarations. Because Knip doesn't need to do slow semantic type-checking like the traditional TypeScript compiler, the oxc-parser allows it to blast through files in a single pass. It starts from entry points and walks the import graph outward. Everything reachable is "used." Everything in your project glob that isn't reachable gets flagged.

The math is a set difference:

This is O(V + E) where V is source files and E is import edges. For 100 files and ~300 imports, that's 400 operations. The kind of thing that finishes before your coffee has cooled. Literally sub-second. The runtime grows linearly with the number of files plus the number of imports between them. Double the files, roughly double the time. This is the computational equivalent of "it just works." Compare this to O(V²), where doubling the files quadruples the time — the computational equivalent of "it worked fine until it didn't."

A caveat, because honesty matters: Big O is a useful shorthand but a misleading one. Bob Sedgewick — Knuth's PhD student, literally wrote the book on algorithm analysis — argues that O-notation is "widely misused to predict and compare algorithm performance" because the hidden constants make experimental verification impossible. You can't test a hypothesis with a hidden constant. Richard Lipton coined the term "galactic algorithm" for algorithms with wonderful asymptotic behavior that are never used in practice — Strassen's matrix multiplication is theoretically superior to the naive approach but only marginally faster even at matrix sizes of 1,000, because the constant overhead eats the asymptotic advantage. And Peter Norvig's famous latency table shows that L1 cache access takes 0.5ns while main memory takes 100ns — a 200x gap that Big O's uniform-cost model completely hides. Cache locality routinely beats asymptotic superiority in real systems.

So why cite Big O at all? Because for this specific problem — walking an import graph of 100 files — the asymptotic class does predict reality. The input is small enough that constant factors don't dominate, the data fits in cache, and the linear-time claim matches wall-clock time. Big O becomes misleading at scale or when comparing algorithms with different constant factors. For "scan 100 files and subtract two sets," it's fine. The accountant doesn't need a supercomputer.

The dependency check is the same trick. Knip resolves each import to a node_modules package, then cross-references against package.json. Listed but never imported? Flagged. Same determinism. The tool doesn't have opinions about your code. It has a ledger.

The plugin system is where it gets interesting. Knip has 146 plugins for frameworks and tools — Next.js, Vite, Vitest, ESLint, Storybook, Playwright, and on and on. Plugins are auto-enabled based on package.json: if next is in your dependencies, the Next.js plugin activates. No configuration required.

Each plugin contributes two things: config files (dynamically parsed to find referenced dependencies) and entry files (added to the module graph as starting points). The Next.js plugin, for example, automatically registers 15+ entry patterns:

The plugin also parses next.config.ts to find dependencies referenced in configuration, and walks into node_modules/ to trace what's actually consumed from each package. These are good design choices that make Knip thorough — it understands the framework's conventions so you don't have to. They also, as I would discover, make Knip enthusiastic about parsing things you never asked it to parse.

The metaphor: Knip is an accountant. It builds a spreadsheet and subtracts column B from column A. The answer is the answer. No vibes, no "I think this might be unused," no probabilistic hand-waving. Run it twice, get the same result. Ask it if it's sure, and it gives you the same answer again, slightly louder.

# How the Accountant Actually Thinks

It's worth unpacking what "deterministic" means here, because the word gets thrown around like it's obvious. It isn't.

Knip's pipeline is a directed acyclic graph of transformations, each one idempotent — run it once, get a result. Run it twice, get the same result. Run it a thousand times, get the same result. There's no state that accumulates, no hidden variables that shift between runs, no stochastic sampling that might produce a different answer on Tuesday. This is the monotonicity guarantee that engineers reach for when they trust a tool: the output converges to a fixed point regardless of input order or repetition.

The pipeline:

Each stage is composable — you can swap the parser, change the resolver, add a new plugin — and the pipeline still produces a valid answer because each stage's output is a well-defined function of its input. This is referential transparency, borrowed from functional programming: the same input always produces the same output, with no side effects.

Most production agents — LangChain, Claude Code, OpenAI's Agents SDK, LlamaIndex workflows — operate by tool calling: the model decides what to invoke, a deterministic function executes it, and the result is returned to context. The agent in this very article called Knip via bash. It didn't try to probabilistically simulate import graph traversal. This means the "agent vs tool" framing in most discussions is slightly false. They're not rivals — they're a stack.

That said, an LLM trying to replace a tool — performing the tool's job by reading files pairwise and guessing at reachability — hits a wall. The model can orchestrate the tool (deciding when to call it, with what inputs), but it can't simulate the tool's determinism. When you ask an LLM "is this file unused?", it doesn't parse the code and compute a set difference. Instead, it uses its context window to simulate traversal probabilistically. Make the codebase large enough, and that simulation will hallucinate or drop context.

# The Run, and the Crash

This is where the agent's lack of an operational cost-model became visible.

The setup. I already had .next/ (381MB) and node_modules/ (594MB) sitting in my project from prior work. They were just there, like that one kitchen drawer everyone has that's full of takeout menus and dead batteries.

The agent's configuration. It created a knip.jsonc with project: ["src/**/*.{ts,tsx}"]. Correct — the glob scopes analysis to source files only. What the agent missed: Knip's Next.js plugin expands scope beyond the glob. It parses next.config.ts for referenced dependencies and walks into node_modules/ to trace package consumption. So despite the reasonable glob, Knip was parsing:

To answer a question about 100 files. That's a signal-to-noise ratio of 100:15,150.

The crash. npx knip died with RangeError: Array buffer allocation failed. The oxc-parser uses a flat ArrayBuffer for AST transfer between Rust and JavaScript. The buffer scales with file content. After parsing thousands of node_modules files, the accumulated pressure made it choke on a 38KB data file with extremely long lines — like eating through a buffet, then dying on a single olive.

The agent's diagnosis: it investigated file sizes, found the two data-heavy files, and patched oxc-parser to disable experimentalRawTransfer. The reports finally ran.

But it was treating the symptom (crash) rather than the cause (15,000 irrelevant files in the input set). I deleted .next/ and node_modules/ manually to make Knip run cleanly. The agent never modeled the computational cost. It's like ordering an Uber to go next door. Correct destination. Absurd journey.

Did removing them affect output quality? No. The .next/types/ directory contains generated types derived from source files — parsing them adds no new signal. The node_modules/ dependency check resolves imports against package.json — it checks package names, not file contents. Parsing 15,000 files to verify eslint is in package.json is reading every book in the library to check for a card catalog. The catalog is right there.

The crash meant the output quality was zero. After removal, Knip ran in seconds and found real unused code. The "good design choices" in the plugin system — parsing next.config.ts for referenced dependencies, walking node_modules/ for package consumption — are what caused the crash. Thoroughness without an escape hatch isn't thoroughness. It's a denial-of-service attack on your own RAM.

The agent treated Knip as a black box. It configured the input correctly (the project glob) and ran the command. When it crashed, it debugged the symptom (the crash) rather than the cause (the input set was 150x larger than necessary).

You don't think about this explicitly. But you carry an implicit mental model: "this tool parses files. I have 15,000 files in node_modules. The tool doesn't need to parse those. I should make sure it doesn't." That model comes from experience — from watching linters eat your RAM, from tsc taking twenty minutes because someone added **/*.ts to the include instead of src/**/*.ts. From that one time you ran ESLint on node_modules and your laptop sounded like it was preparing for liftoff.

The agent didn't have this operational model. Not because it can't form one, but because the context it was working in — the Knip plugin docs and the knip.jsonc schema — didn't surface the right information at the right time. The Knip docs do have a performance guide that recommends excluding files from analysis and an --debug flag to diagnose bottlenecks. The information was there. The agent just didn't connect it to the problem at hand.

# The False Positives and the Revert

Three beats of the agent doing the right thing at the wrong time, in the wrong order, for the wrong reasons.

Beat 1: The eslint removal.

Knip found 13 unused dependencies. The agent removed them all with the confidence of a surgeon who hasn't checked the X-ray. But three of them, eslint, eslint-config-next, and @eslint/eslintrc, aren't actually unused. They're referenced indirectly: "lint": "next lint" in package.json scripts depends on them through the Next.js framework. Knip's dependency tracer couldn't follow that indirection. The agent didn't catch it proactively. It caught it reactively when npm run lint failed, which is the software equivalent of locking your keys in the car and only noticing when you try to drive to work.

This is a context problem. The agent had the package.json in its context. It had the Knip output in its context. It didn't connect the two because the connection requires understanding that next lint transitively depends on eslint. That's framework knowledge, not file-reading knowledge. The agent can read files. It can't read the room.

Beat 2: The export auto-fix.

The agent ran npx knip --fix --fix-type exports. Knip removed export default from menuConfig.ts, which broke the barrel re-export in header/index.ts (export * from './DropdownMenu/menuConfig'). Build failed. Full git checkout -- . revert. At this point I wasn't debugging. I was bargaining.

The agent's recovery: mark exports as "warn" in the rules config, skip the auto-fix entirely. Pragmatic, but 62 unused exports remain unaddressed. The barrel re-export pattern confused Knip's analysis, and the agent had no way to work around it without manual review of each export. Sometimes the right move is to leave the table.

Beat 3: The final tally.

8 subagent tasks dispatched. 4 failed or needed recovery. A 50% success rate — but without a baseline, the number mostly conveys the texture of the session rather than measuring anything real. What matters: the agent correctly configured the project glob, diagnosed the crash symptoms accurately, and successfully landed a cleanup with a passing build. That's worth a sentence, not just a parenthetical.
1 full git revert.
3 Knip crashes.
1 patched node_modules file. Yes, the agent monkey-patched a dependency to work around a parser bug. No, I'm not proud of us.
3 false positive dependency removals caught in verification.
~30 discrete operations, ~150,000 tokens consumed.
Result: 9,707 lines removed, 65 files deleted, 13 dependencies cleaned, build passing.

Knip's contribution: the analysis. 15 seconds, $0, deterministic. The agent's contribution: the orchestration, the debugging, the recovery. Slower, expensive, imperfect, but it landed. The task — cleaning a Next.js codebase with framework-specific transitive dependencies, barrel re-exports, and plugin-driven scope expansion — genuinely was hard. Even a senior developer might have tripped on the eslint transitivity problem. What the agent got wrong was the input scope. Everything else it executed well.

I later learned this division of labor isn't unique to my project. Anthropic published a breakdown of their SWE-bench agent — the same setup: a model that reasons about what to do, and two tools (Bash and a file editor) that execute. The model's job is knowing what to run. The tools' job is running it. When that boundary blurs, you get 15,000 files parsed to answer a question about 100.

# What Changed

This isn't about agents being bad or good. It's about how agentic workflows change developer behavior in ways that aren't immediately obvious, like how having a dishwasher changes how you think about rinsing plates.

The assumption shift. In the pre-agentic era, if I installed a static analysis tool and it crashed, my first thought would be "what's in the input set?" I'd check .gitignore, I'd look at folder sizes, I'd exclude build artifacts. That was instinct, not expertise. Every developer who has watched a linter choke on node_modules develops that instinct. It's the same instinct that makes you look both ways before crossing a street, even if the light is green. The light doesn't know about the guy running the red.

With agentic workflows, that instinct atrophies. Not because the knowledge disappears, but because the activation pattern changes. When an agent is driving, I don't think about folder sizes or input sets. I assume the agent will handle it. The agent has access to the filesystem, it can read .gitignore, it can check file sizes. Why wouldn't it? It's like assuming your friend who's driving knows the way, because they have Google Maps. Then you end up in a river.

The answer: because the agent's context is shaped by the documentation it reads and the prompts it receives. The Knip documentation talks about configuration, not about computational hygiene. The agent's prompt says "run Knip and fix the issues," not "run Knip without doing anything stupid." The agent does what it's asked. The gap between what it's asked and what it should do is where the human judgment lives. It's also, increasingly, where the human was sitting before the agent drove into a ditch.

The discovery gap, reframed. The original framing was "LLMs hide tools from developers." That's not quite right. The more accurate framing: agentic workflows change the default action from "search for a tool" to "ask the agent." The cost of searching is the same as it ever was. The cost of asking dropped to zero. So the equilibrium shifted. It's the same reason people use delivery apps for groceries that are literally downstairs. Not because the app is better. Because opening the app is easier than putting on shoes.

Bijoy's Slack message was a reminder that the old equilibrium still exists. The tool was there. I just wasn't reaching for it anymore. I was asking the agent, and the agent was asking Knip, and Knip was asking my RAM for 6 gigabytes, and my RAM was asking for a lawyer.

The context boundary. Most of what went wrong in this session traces back to context. The agent didn't know to exclude .next/ because that knowledge wasn't in the Knip docs it prioritized. It didn't know eslint was transitively needed because framework knowledge isn't in package.json. It didn't know the export auto-fix would break barrel re-exports because it hadn't read the barrel files in the same context as the Knip output. Each failure is a story about two pieces of information that needed to be in the same room but weren't.

This isn't a fundamental limitation of LLMs. It's a context window problem. Anthropic's own Claude Code best practices lead with this: "Most best practices are based on one constraint: Claude's context window fills up fast, and performance degrades as it fills." The docs recommend subagents, clearing context between tasks, and aggressive context management — not because the model is weak, but because context is the bottleneck. If the agent could hold the entire project, every file's content, the Knip documentation, and the operational state of the machine in a single context, it would make different decisions. It can't. So it makes locally optimal decisions that are globally suboptimal. That's not a bug. That's the nature of working with bounded context. It's also the nature of being human, incidentally. We just have better heuristics for papering over the gaps.

The research on retrieval-augmented generation backs this up — fine-grained, targeted context outperforms dumping everything into the window. The agent didn't need more context. It needed the right context at the right time. That's what I was supposed to provide, and didn't.

The irony: Knip has since shipped an MCP Server specifically for coding agents. The tagline: "Tell your coding agent to 'configure knip' and it will RTFM so you don't have to." If I'd run this session six months later, the agent might have used the MCP Server to configure Knip correctly from the start. The tool I was complaining about had already built the escape hatch I was wishing for. I just didn't know it existed.

# Developing Taste

So when do you reach for a tool, and when do you ask an agent? I've been thinking about this in terms of a rough heuristic. Not a decision tree — those are for people who think the world has clean branches. More like a set of forces that pull you in one direction or the other.

The left column — "reach for a tool" — describes static analysis, type checking, linting, formatting, build systems: problems where the algorithm is known, the input is structured, and the answer is binary (correct/wrong). The right column — "ask the agent" — describes code generation, architecture decisions, debugging hypotheses, test case design: problems where the answer is a draft you iterate on (better/worse).

Most real tasks sit somewhere in the middle. The Knip session was a left-column task (find unused code) that I delegated to a right-column worker (an LLM agent), with mixed results. The failure wasn't that the agent "tried to replace Knip with vibes." It was weaker orchestration: poor tool configuration (didn't scope the input set correctly before invoking), weak pre-call reasoning (didn't model the computational cost), and reactive diagnosis (debugged the symptom, not the cause). These are orchestration failures, not paradigm failures.

The taste develops from experience. You reach for a tool, it works, you remember. You ask an agent, it hallucinates, you remember. Over time you build an internal model of which problems are "tool problems" and which are "agent problems." The model isn't perfect. But it's better than treating every problem the same way.

# What I Took Away

Purpose-built tools have a computational advantage that agents can't replicate. Knip parses 100 files in linear time, finishing in seconds for $0. An agent doing the same analysis by reading files pairwise would need exponentially more token calls. Anthropic's own SWE-bench report notes that successful agent runs "took hundreds of turns... and >100k tokens" — and that the model's tenacity "can be expensive." In my session, the agent consumed ~150,000 tokens to orchestrate what Knip did in 15 seconds. The math doesn't care how smart the model is. And the industry knows it: Anthropic's agent design guide explicitly frames every architectural decision as a "latency and cost" tradeoff, recommending that easy tasks be routed to cheaper models and agent complexity be added only when simpler solutions fail. The billable unit of agent efficiency isn't operations — it's tokens. The billable unit of tool efficiency isn't tokens — it's milliseconds. These are different currencies, and the exchange rate is not in the agent's favor.

The models are getting smarter — scaling laws show performance improves predictably with compute. At time of writing (early 2025), Claude 3.5 Sonnet solved 49% of real-world GitHub issues on SWE-bench, up from 22% a year prior. By mid-2026, the scores have shifted further. But smarter models consume more tokens per inference, and the cost-per-task curve hasn't flattened. A faster racehorse is still more expensive than a bicycle, if the bicycle gets you there.
The agent's job is orchestration, not simulation. Knip removed 9,707 lines in seconds. The agent spent 150,000 tokens setting up, debugging, and recovering. The right division of labor: the tool does the analysis, the agent does the plumbing. ~~You don't ask your search engine to also write the email~~.
Agentic workflows atrophy operational instincts. I stopped thinking about input set hygiene because I stopped being the one running the tools. That instinct was earned through years of watching tools choke. It's worth preserving, even when someone else is driving. It's like knowing how to drive stick. You don't need it every day. But the day the automatic transmission fails, you're glad you remember.
Context boundaries explain most agent failures, not capability boundaries. The agent didn't exclude .next/ because it didn't have the operational context to know it should. Not because it can't understand folder sizes. The distinction matters because it points to the fix: better prompts, better tooling, better context management. Not a bigger model. You don't need a bigger hammer. You need to stop hitting the same nail.

The FunSearch insight is what stays with me. Google DeepMind's FunSearch used an LLM to discover better algorithms for the bin-packing problem — but it wasn't a magic moment of "thinking" a better algorithm. It was an evolutionary loop: the LLM generates code, a deterministic evaluator scores it, the best solutions are fed back, the cycle repeats. The LLM provides creativity. The evaluator provides correctness. Neither could have done it alone.

This is the article's thesis in its sharpest form: the model generates, the tool evaluates, and the tool makes the search honest. Without the deterministic check, you're just sampling from a probability distribution and hoping the output is correct. With it, you're doing directed search. The difference between "hoping" and "searching" is the difference between O(2^n) and O(n log n).

# Further Reading

Scaling Laws for Neural Language Models — Kaplan et al. (2020). The foundational paper showing LLM performance scales as a power law with compute, model size, and data. The reason models keep getting smarter isn't magic — it's math.
Raising the Bar on SWE-bench Verified — Anthropic (2025). How Claude 3.5 Sonnet achieves 49% on real-world software engineering tasks using a minimal agent scaffold. The agent reasons; the tools execute.
FunSearch: Making New Discoveries in Mathematical Sciences Using LLMs — Fawzi et al., Nature (2023). The first LLM-driven discovery in open mathematical problems. An LLM generates algorithms, a deterministic evaluator scores them, and the cycle discovers solutions that outperform human-designed heuristics. The model provides creativity; the tool provides correctness.
Claude Code Best Practices — Anthropic (2025). The company that builds Claude says context management is the #1 constraint. "Claude's context window fills up fast, and performance degrades as it fills." Practical guidance on when to use subagents, when to clear context, and how to structure agentic workflows.
A Survey on Large Language Model based Autonomous Agents — Wang et al. (2023). Comprehensive survey of LLM-based agents across domains. Useful for understanding the architectural patterns that make agents work (or fail).
Knip Documentation — The tool that started all this. Read the plugin docs, not just the config docs.
How To Avoid O-Abuse and Bribes — R.J. Lipton (2009). Bob Sedgewick's argument that Big O notation is widely misused to predict performance, and that ≈ notation with explicit constants is more honest. "Experimental verification of hypotheses is easy with ≈ notation and impossible with O-notation."
Galactic Algorithms — R.J. Lipton (2010). Coined the term for algorithms with wonderful asymptotic behavior that are never used in practice. David Johnson: "I would easily prefer |V|^70 to even constant time, if that constant had to be one of Robertson and Seymour's."
Building Effective Agents — Anthropic (2024). "Agentic systems often trade latency and cost for better task performance." The industry's own framing of agent efficiency: not in Big O terms, but in dollars and milliseconds.

xxx

How Not To Ask An Agent For A Fix

I did eventually get the stale homepage issue sorted. The more interesting part was how much conversational wandering it took to get there, and what that taught me about using gpt-5.4 for design thinking instead of just code generation.

xxx

The Bug Was In The Conversation

When agents write the code, the bugs hide in the conversation. A look at the three most common ways agentic workflows fail: scope leaks, contract leaks, and finish-line leak

All Bloqs

xxx

Apr 15, 2026

12 min read

Tools vs Agents: What Gets Lost When You Stop Reaching

Static analysis is O(V+E). Agentic workflows are O(I-have-no-idea). What happened when I installed Knip, handed it to an agent, and stopped thinking about input sets.

#ai

#debugging

#nextjs

#architecture

#reflections

"Has anyone used knip?"

Bijoy's message snapped me out of that loop. I installed Knip. Then, because I apparently haven't suffered enough, I handed it to an AI agent to orchestrate. That's where things got interesting.

# What Knip Actually Does (The Accountant)

The math is a set difference:

# How the Accountant Actually Thinks

It's worth unpacking what "deterministic" means here, because the word gets thrown around like it's obvious. It isn't.

The pipeline:

# The Run, and the Crash

This is where the agent's lack of an operational cost-model became visible.

To answer a question about 100 files. That's a signal-to-noise ratio of 100:15,150.

The agent's diagnosis: it investigated file sizes, found the two data-heavy files, and patched oxc-parser to disable experimentalRawTransfer. The reports finally ran.

# The False Positives and the Revert

Three beats of the agent doing the right thing at the wrong time, in the wrong order, for the wrong reasons.

Beat 1: The eslint removal.

Beat 2: The export auto-fix.

Beat 3: The final tally.

8 subagent tasks dispatched. 4 failed or needed recovery. A 50% success rate — but without a baseline, the number mostly conveys the texture of the session rather than measuring anything real. What matters: the agent correctly configured the project glob, diagnosed the crash symptoms accurately, and successfully landed a cleanup with a passing build. That's worth a sentence, not just a parenthetical.
1 full git revert.
3 Knip crashes.
1 patched node_modules file. Yes, the agent monkey-patched a dependency to work around a parser bug. No, I'm not proud of us.
3 false positive dependency removals caught in verification.
~30 discrete operations, ~150,000 tokens consumed.
Result: 9,707 lines removed, 65 files deleted, 13 dependencies cleaned, build passing.

Purpose-built tools have a computational advantage that agents can't replicate. Knip parses 100 files in linear time, finishing in seconds for $0. An agent doing the same analysis by reading files pairwise would need exponentially more token calls. Anthropic's own SWE-bench report notes that successful agent runs "took hundreds of turns... and >100k tokens" — and that the model's tenacity "can be expensive." In my session, the agent consumed ~150,000 tokens to orchestrate what Knip did in 15 seconds. The math doesn't care how smart the model is. And the industry knows it: Anthropic's agent design guide explicitly frames every architectural decision as a "latency and cost" tradeoff, recommending that easy tasks be routed to cheaper models and agent complexity be added only when simpler solutions fail. The billable unit of agent efficiency isn't operations — it's tokens. The billable unit of tool efficiency isn't tokens — it's milliseconds. These are different currencies, and the exchange rate is not in the agent's favor.

The models are getting smarter — scaling laws show performance improves predictably with compute. At time of writing (early 2025), Claude 3.5 Sonnet solved 49% of real-world GitHub issues on SWE-bench, up from 22% a year prior. By mid-2026, the scores have shifted further. But smarter models consume more tokens per inference, and the cost-per-task curve hasn't flattened. A faster racehorse is still more expensive than a bicycle, if the bicycle gets you there.
The agent's job is orchestration, not simulation. Knip removed 9,707 lines in seconds. The agent spent 150,000 tokens setting up, debugging, and recovering. The right division of labor: the tool does the analysis, the agent does the plumbing. ~~You don't ask your search engine to also write the email~~.
Agentic workflows atrophy operational instincts. I stopped thinking about input set hygiene because I stopped being the one running the tools. That instinct was earned through years of watching tools choke. It's worth preserving, even when someone else is driving. It's like knowing how to drive stick. You don't need it every day. But the day the automatic transmission fails, you're glad you remember.
Context boundaries explain most agent failures, not capability boundaries. The agent didn't exclude .next/ because it didn't have the operational context to know it should. Not because it can't understand folder sizes. The distinction matters because it points to the fix: better prompts, better tooling, better context management. Not a bigger model. You don't need a bigger hammer. You need to stop hitting the same nail.

# Further Reading

Scaling Laws for Neural Language Models — Kaplan et al. (2020). The foundational paper showing LLM performance scales as a power law with compute, model size, and data. The reason models keep getting smarter isn't magic — it's math.
Raising the Bar on SWE-bench Verified — Anthropic (2025). How Claude 3.5 Sonnet achieves 49% on real-world software engineering tasks using a minimal agent scaffold. The agent reasons; the tools execute.
FunSearch: Making New Discoveries in Mathematical Sciences Using LLMs — Fawzi et al., Nature (2023). The first LLM-driven discovery in open mathematical problems. An LLM generates algorithms, a deterministic evaluator scores them, and the cycle discovers solutions that outperform human-designed heuristics. The model provides creativity; the tool provides correctness.
Claude Code Best Practices — Anthropic (2025). The company that builds Claude says context management is the #1 constraint. "Claude's context window fills up fast, and performance degrades as it fills." Practical guidance on when to use subagents, when to clear context, and how to structure agentic workflows.
A Survey on Large Language Model based Autonomous Agents — Wang et al. (2023). Comprehensive survey of LLM-based agents across domains. Useful for understanding the architectural patterns that make agents work (or fail).
Knip Documentation — The tool that started all this. Read the plugin docs, not just the config docs.
How To Avoid O-Abuse and Bribes — R.J. Lipton (2009). Bob Sedgewick's argument that Big O notation is widely misused to predict performance, and that ≈ notation with explicit constants is more honest. "Experimental verification of hypotheses is easy with ≈ notation and impossible with O-notation."
Galactic Algorithms — R.J. Lipton (2010). Coined the term for algorithms with wonderful asymptotic behavior that are never used in practice. David Johnson: "I would easily prefer |V|^70 to even constant time, if that constant had to be one of Robertson and Seymour's."
Building Effective Agents — Anthropic (2024). "Agentic systems often trade latency and cost for better task performance." The industry's own framing of agent efficiency: not in Big O terms, but in dollars and milliseconds.

xxx

How Not To Ask An Agent For A Fix

xxx

The Bug Was In The Conversation

When agents write the code, the bugs hide in the conversation. A look at the three most common ways agentic workflows fail: scope leaks, contract leaks, and finish-line leak