Trust Is a Build Artifact: A Testing Philosophy for Agentic Work

xxx

Mar 17, 2026

8 min read

Testing is not proof, and it is definitely not virtue. It is the system I use to turn refactors, AI-assisted code, and late-night confidence into something deterministic enough to trust.

#testing

#vitest

#playwright

#typescript

#ai

#architecture

"Testing is the art of being betrayed early, cheaply, and with good logs."

I had one of those afternoons where I opened a test file "just to check something" and then looked up to discover that the day had quietly left without me.

There is a particular brand of optimism that developers tend to invoke when we do not want to write tests:

"It is fine. I will be careful."

It is never fine. I am not careful. I am merely optimistic with excellent syntax.

That was fine advice for an era when writing code and writing tests were both slow. Both cost the same resource: my time and attention. I could budget them against each other.

That era is over. Agents now write plausible code faster than I can review it. The bottleneck shifted. The expensive thing is no longer keystrokes. It is knowing what to trust. And the only honest answer to that question, in a codebase where patches arrive at model speed, is a test suite that earns its authority rather than merely existing.

For a long time I described testing in the usual pious language: quality, correctness, reliability, confidence. All true, all slightly bloodless. The more honest description is simpler.

Testing is how I buy the right to change code without behaving like I am defusing a bomb.

In the agentic era, that statement stops being motivational-poster fluff and becomes operating procedure. The bottleneck is no longer writing. The bottleneck is verification. The expensive thing now is not keystrokes. It is human attention.

That is what this piece is really about: not "my testing setup" as a museum tour, but the set of constraints and recurring losses that pushed me into this setup in the first place. Vitest, Playwright, route tests, service tests, integration checks: none of them arrived because I sat down one day and designed an elegant taxonomy. They arrived because I kept paying for certain classes of ignorance and eventually got tired of financing them personally.

# First Principles

The only definition of a test I still believe is this:

A test is a deterministic evaluator over a chosen boundary.

That sentence is carrying more weight than it first appears to.

# Deterministic evaluator

A code review is an evaluator too. So is a product demo. So is me staring at a diff with the spiritual energy of a skeptic hunting for a ghost. But those evaluators are expensive, inconsistent, and ultimately bottlenecked by human verification debt.

A test, by contrast, is gloriously uncharitable.

It does not care whether the patch feels elegant. It does not care whether I was tired. It does not care how confidently the agent narrated its reasoning. It does not care whether I am "pretty sure" the refactor is equivalent.

It just compares behavior against expectation and either lets me proceed or embarrasses me immediately.

That is the real superpower: not intelligence, but repeatability.

# Chosen boundary

This is the part that makes testing less about "quality" as an ideal and more about boundaries as a tool.

Every test is a deal. I choose what to include inside the boundary, what to fake outside it, and what question I want answered. Most arguments about unit tests versus integration tests are really arguments about where someone drew that boundary and whether it was worth the cost.

The boundary is literal.

Move it outward and you buy more realism but pay in setup cost, slower feedback, and harder forensics when something fails. Move it inward and you go faster but lose confidence in the seams. The position of the line is an economic decision, not a moral one.

In the agentic era, that decision does not go away. An agent picking the test layer by default is not making this trade-off; it is skipping it. The boundary still belongs to whoever has a theory about what the software should do.

That is why I still find the Practical Test Pyramid and Kent C. Dodds' Testing Trophy useful. Not because they settle anything permanently, but because they force the economic question:

Where should most of my confidence come from, and what is the cheapest honest evaluator for this behavior?

These two frameworks answer that question differently, and the difference matters.

The Pyramid says: write many fast unit tests, some integration tests, and very few E2E tests. The Trophy makes a different argument. Most bugs live at the seam between components, not inside any single one. Integration tests earn the widest band because that is where the confidence-per-cost argument resolves most favorably. Unit tests are intentionally smaller. Static analysis sits at the base because it catches cheap mistakes before any test runs at all.

I try to avoid the 'test everything' trap. Aiming for maximum coverage often produces a ceremonial suite that is too loud to be useful and too brittle to survive a refactor. I’d rather have a smaller, honester suite that I actually trust when the build goes red.

I want evaluators that are:

cheap enough to run often
specific enough to diagnose failure
broad enough to matter
stable enough to survive refactors
honest enough to reduce human verification work

Everything else is implementation detail.

What "cheap" actually means has shifted.

Before agentic tooling, the dominant cost of a test was the time it took to write it. A unit test that saved ten minutes of future debugging was worth the five minutes it took to author. I could reason about the trade-off with my own calendar.

That arithmetic has changed. Writing is now near-free. An agent can scaffold a hundred tests while I read the pull request description. But the rest of the cost structure is unchanged or higher:

Running cost: wall-clock time in CI, or real API spend if the test invokes an LLM at each step
Maintenance: every test that breaks on a neutral refactor is work I pay for later
False confidence: a test that passes on wrong behavior is not free. It costs the debugging session I did not know I still owed

So the evaluator properties above are not a wish list from a slower era. They are a more precise checklist for the era where writing is cheap and everything after writing is not.

# What I Am Actually Optimizing For

When I say I care about testing, I do not mean that I enjoy the ceremony of assertions. Even now, when an agent can author them in seconds, the verification loop feels like renewing government documents. I do it because the alternative is a codebase I can no longer explain or trust.

These are the things I am actually optimizing for.

# 1. Signal

A failing test should mean something I care about.

If a test fails because I renamed a class, changed a glyph, reordered markup, or otherwise touched an implementation detail that users never perceive, the test is not protecting behavior. It is charging me rent for the privilege of refactoring.

Eventually I start ignoring that kind of failure, and once that happens the suite begins its long career as theater.

# 2. Coupling

I want tests coupled to externally meaningful behavior, not internal choreography.

Google's Testing on the Toilet said it best years ago: test behavior, not implementation. I have since translated that into my own less charitable phrasing:

If a refactor that preserves behavior breaks the suite, the suite is part of the bug.

# 3. Forensics

When something fails, I want evidence.

Not vibes. Not "works on my machine." Not a gut feeling that Playwright is being moody today.

I want:

a trace
a screenshot
the request and response shape
the exact assertion that failed
enough context that I do not need to re-run the same thing five times hoping clarity descends from heaven

Debugging is not a morality play. It is an evidence problem.

# 4. Human verification reduction

This is the one that became impossible to ignore once I started using AI to write more code.

Sonar's 2026 survey is one of the more useful summaries of the current situation: developers are generating more code with AI, but review and verification effort is becoming the choke point, with many reporting that reviewing AI-generated code takes more effort than reviewing human-written code. Their framing of a "verification gap" is worth reading because it names the real workflow problem directly: Sonar verification gap.

That resonates because it matches exactly what I feel locally.

The expensive thing in software is no longer typing. The expensive thing is deciding what to trust.

So my tests are not just quality tools. They are review compressors. They shrink the amount of code I need to hold in my head at once. They let me stop auditing every line like a customs officer searching for contraband intent.

# The Costs That Forced My Stack Into Shape

My stack is not exotic. I have a Next.js TypeScript repo. I use Vitest for most of the fast loop, React Testing Library where component behavior matters, and Playwright where the browser gets final say. But those are the nouns. The more useful story is the sequence of prices I got tired of paying.

# I got tired of being wrong about things the computer already knew

The absolute cheapest place to be wrong is in the editor, before the code even saves.

In this repo, static analysis handles the structural noise. TypeScript and ESLint form the base of the Testing Trophy because they provide high-signal feedback for almost zero cost. They aren't a replacement for tests; they are simply the fastest way to flush out typos and type mismatches before they can hide in the logic.

Hallucinated types and unhandled nulls are not 'business logic' problems. They are structural noise. In an era where patches arrive at model speed, static analysis is the first line of defense. If the system rejects a bad proposal in milliseconds, I have saved myself the debugging session I would have otherwise wasted on a hallucination.

I treat static analysis as the floor. If the types don't pass, the behavior is irrelevant.

# I got tired of re-debugging math I had already "basically" understood

The cheapest humiliation in software is pure logic drifting while I am busy feeling clever about a refactor.

Pagination math, normalization, parsing, filtering, little ranking rules, data reshaping - all of it has the same personality. It looks too small to deserve ceremony until it quietly changes the behavior of an entire page. That class of mistake is why Vitest ends up doing so much work for me. Not because "unit tests are good," but because this is the cheapest place to stop lying to myself.

If the behavior can be represented as input in, output out, with no browser, no router, no network, no ceremony, I want it under npm test and I want it there immediately. That is the bargain: I pay a small amount of upfront precision so I do not have to re-derive the invariant later from a bug report.

Vitest is useful here for very practical reasons:

it is fast enough that I do not resent it
it sits close enough to the app that I do not feel like I am switching worlds
it lets me use coverage as a flashlight when I need one, not as a personality trait (coverage guide)

If a pure-logic test fails, I treat that as high-signal until proven otherwise. If I cannot explain the invariant in one sentence, I assume I do not yet understand the code well enough to change it safely.

# I got tired of tests punishing me for changing presentation instead of behavior

This one was entirely self-inflicted.

I used to test UI too literally. I would assert on exact markup shapes, exact text fragments, exact component structure, sometimes with the confidence of someone who had not yet met future-me and therefore had no reason to fear him.

That always ages badly.

The bill comes due the first time a component evolves in a user-neutral way and the suite behaves as if I have vandalized the constitution. That is why React Testing Library became part of my default loop. It nudges me, repeatedly and somewhat condescendingly, toward the kind of questions that survive refactors:

can the user type here?
can they click the control that matters?
does the route change after the action?
does the visible state update in the way a person would actually perceive?

That is what Testing Library's guiding principles are really doing for me. They are not teaching me kindness. They are teaching me coupling discipline.

I still run these under Vitest because I want component behavior to stay in the cheap loop as long as possible. I want to catch "the search interaction broke" before I involve a real browser, a real server process, and a real hour of my life.

# I got tired of discovering architectural mistakes only after the side effects had already happened

Most meaningful bugs in application code are not pure-math bugs and not full-browser bugs. They live in the middle, where a workflow does three or four boring things in a particular order and one of those things quietly stops happening.

That is what pushed me toward service-level tests.

This layer exists because I kept paying for a specific kind of uncertainty:

did validation happen before persistence?
did the side effect fire after the write or before it?
did invalid input short-circuit cleanly?
did an error produce half-finished state?

None of this is glamorous. All of it is expensive to debug late.

So when I write tests against a service with a fake repository or fake mutation effect, I am not performing some abstract "unit testing best practice." I am buying a controlled environment for the orchestration I actually care about. The mock is not the point. The sequencing is the point. Google's don't overuse mocks still holds: I am trying to mock outside the workflow boundary, not inside my own thinking.

This is also one of the first places agentic work becomes tolerable. If a model takes a pass at application logic, this seam gives me a cheap judge for whether the workflow still behaves like the system I intended, rather than the one the model found statistically plausible.

# I got tired of silently changing my own API and only realizing it later

Route and contract tests came out of embarrassment more than theory.

Small projects are especially good at lying to their owners here. Because I am often both producer and consumer, it is easy to think the API is not "real" enough to deserve explicit verification. Then I change a status code, an error shape, or a response payload during a refactor, and future-me gets to enjoy a tiny private integration failure with no witnesses.

That is why route tests have become some of the highest-return tests in the repo. They force me to pin down the wire contract while the change is still fresh in my head:

what status does this route return now?
what does the error body look like?
what did I just decide the client is allowed to depend on?

Pact's explanation of contract testing is useful not because I need every piece of consumer-driven contract machinery in a personal repo, but because it captures the core thing I kept forgetting: the wire shape is part of the product. If I change it intentionally, the test makes me say so. If I change it accidentally, the test gets there before a broken page does.

# I got tired of pretending reality was cheap

Real infrastructure gives me a kind of honesty that fakes never can, and it charges accordingly.

Integration tests against a real backend buy important signal:

the schema actually matches what I think it matches
the persistence path behaves the way the application assumes
the whole thing survives contact with a real environment

They also buy:

secrets
setup cost
intermittent weirdness
the low-level anxiety that I have pointed at the wrong environment and am about to become a cautionary tale

So I keep those tests env-gated and explicit. That is not an omission in my philosophy. It is my philosophy. Reality is expensive. I want to buy it deliberately, not attach it to every local save out of moral vanity.

# I got tired of local reasoning being correct in pieces and wrong in sequence

This is where Playwright enters, and it enters late on purpose.

There is a particular class of bug that survives every elegant lower-level test because nothing is individually broken enough. The URL updates but the visible state does not. The component renders correctly but the actual click path is wrong. The browser timing reveals a race that all my clean little isolated tests politely declined to mention.

That is what I am paying Playwright for.

Not browser automation as a concept. Not the warm feeling of having E2E. I am paying for one thing only: the right to stop theorizing and watch the whole thing behave under a real browser runtime.

And because that price is high, let's make the tool earn it:

locators, because if my selectors are brittle I am manufacturing future chores
trace viewer, because if the browser fails I want a timeline, not a séance
test retries, because flakes do not become less real when I act offended by them
test reporters, because expensive tests should at least fail with evidence

I keep E2E small for the same reason I keep lawyers expensive: I only want to invoke them when the matter genuinely deserves the full machinery.

# The Mistake That Taught Me More Than the Passing Tests

The most educational class of failures in my repo has not been "the code was broken." It has been "the test was coupled to the wrong thing."

Pagination controls taught me this in a deeply unserious but unforgettable way.

I had tests that effectively assumed a very specific expression of pagination UI. The user-visible idea was sound: navigate between pages, preserve search state, keep behavior consistent. But some of the tests leaned too hard on a particular textual or structural expression of that behavior. Then the component evolved, and suddenly the suite was acting like I had violated a sacred treaty when in reality I had mostly changed the presentation.

That is the sort of failure that looks annoying on the surface but is pedagogically generous.

It teaches three things at once:

The suite is telling me where it is overcoupled.
A refactor always renegotiates contracts, whether I acknowledge it or not.
Tool choice matters only insofar as it helps me express the right boundary.

That is why "test behavior, not implementation" stopped sounding like advice and started sounding like rent control.

# Agentic Engineering and the Real Bottleneck

The biggest conceptual shift for me has been this:

The limiting reagent in software work is no longer code generation. It is trustworthy evaluation.

That is why benchmark design around coding agents keeps circling back to evaluation harnesses and human validation.

SWE-bench's own evaluation guide and harness reference are useful not just as benchmark docs but as philosophy. The basic loop is brutally instructive:

apply the generated patch
run the repository's tests
judge success from the resulting behavior

That is the exact shape I want in miniature inside my own repo.

OpenAI's original SWE-bench Verified write-up makes the deeper point explicit: even the benchmark itself needed human validation because issue descriptions, tests, and task framing can all be subtly wrong. In other words, evaluation is difficult enough that we sometimes need to verify the verifier.

Anthropic's Building effective agents lands on the same principle from another angle: agents perform better when they can interact with tools that provide concrete ground truth rather than relying on narration and vibes. In a codebase, tests are often the cleanest source of ground truth available.

This has changed how I think about my own review process.

I do not want to read every AI-assisted patch like a suspicious schoolmaster grading handwriting. I want the tooling and the suite to do the first round of disbelief for me. The more faithfully my tests reflect the boundaries I actually care about, the more human review can move up a level: less "is this line secretly wrong?" and more "is this the behavior I want in the product?"

The ideal division of labor looks like this:

the model proposes
the tests judge
I arbitrate intent, tradeoffs, and product meaning

This approach keeps the human in the loop without forcing the human to be the loop.

# Does cheap generation change the calculus?

An agent can draft a test suite for an entire module in the time it takes to make coffee. If writing is effectively free, the obvious question is: why not generate tests at every layer and let coverage saturate? The economic question dissolves. The cheapest honest evaluator is simply all of them.

But we know this does not work, and the reason is structural.

When the same model writes both code and test in the same context window, it encodes the same logical error into both. The assertion's expected value is derived from the implementation, not from the intent. The test passes not because the behavior is correct but because both artifacts share the same origin assumption. MSR 2026 research found that coding agents are more likely to add mocks validating expected interfaces than real behavior. A 2026 survey on AI test generation found it achieves 92% accuracy on controlled benchmarks but only 41% in production settings — the gap between benchmark and reality is the gap between testing what you wrote and testing what you meant.

This does not mean agents are useless for tests. It means the tool should match the layer. IDE assistants and coding agents are genuinely useful for unit test boilerplate where the logic is narrow and a human can verify the assertion quickly. Integration test design benefits from reasoning-heavy models with full codebase context, because the question involves how components wire together. E2E flows work better with dedicated platforms that navigate real browser state than with a general model prompted in a chat window. The mistake is applying the cheapest-to-invoke tool to the highest-complexity question.

And even setting aside quality, running costs do not disappear when writing is free. A traditional Playwright test costs CI compute time, a few seconds and marginal costs per run. An agentic E2E test that calls a frontier model at each step costs $0.15 to $0.35 per execution in API fees, before CI time, before retries, before the second model call needed to judge whether the first one passed. Agentic workflows use five to twenty times more tokens than traditional scripts because they plan, self-correct, and maintain context across steps. A team running fifty such tests on every push is looking at a bill of thousands per month in inference spend for testing alone. That is not an enterprise figure. That is a normal product team doing normal CI.

What the agentic era genuinely changes is the cost of the first draft. Writing is near-free. Running, maintaining, and trusting are not. The Pyramid and Trophy were never about how hard it is to write tests. They were about where to buy confidence per unit of ongoing cost. That question is sharper now, not obsolete.

# My Practical Operating Model

The philosophy only matters if it cashes out into an actual working loop.

This is the loop I keep returning to.

When I touch pure logic, I run npm test first because that is the cheapest place for reality to interrupt me.

When I touch component behavior, I still start there, because I want interaction and semantics to fail in the fast loop before I buy browser runtime.

When I change route shapes or client-visible payloads, I want contract and route tests to go red early so the review surface becomes "what did I just decide the interface means?" instead of "please inspect this entire handler and infer my intent from the rubble."

When I touch a high-value flow, I run npm run test:e2e because that is where composition stops being theoretical.

Before I push, the minimum is still npm run lint and npm test. If the change is user-visible, structurally risky, or exactly the sort of thing I know future-me will otherwise need to manually re-verify, I pay for npm run test:all.

Two tools do most of the work in that loop. Vitest runs in Node.js against a simulated DOM with no browser, no network, and no ceremony. It is fast enough that I run it without thinking about the cost. Playwright runs real Chromium. It cannot simulate; it has to actually load the application, click controls, and wait for the network. That is its value and its price. I use Vitest when I want speed and Playwright when I want truth.

That is not discipline for its own sake. It is me spending verification money where it buys the largest reduction in future human doubt.

# The Short Version

If I strip away all the tool names, categories, and tasteful theory links, my testing philosophy is embarrassingly simple:

cheap evaluators should carry most of the load
expensive evaluators should buy reality, not ego
tests should reduce the amount of code I must personally distrust
the agentic era makes verification, not generation, the scarce resource

The agentic era did not make the economic question cheaper. It made getting the answer wrong more expensive.

That is why I care about testing.

Not because it makes me virtuous. Not because it makes the repo look grown up. Not because a dashboard somewhere can render a coverage donut in brand colors.

I care because testing is how I turn code, especially AI-assisted code, into something I can change without negotiating with fear every single time.

# Further Reading

Practical Test Pyramid
Testing Trophy and testing classifications
Testing Library: Guiding principles
Google Testing on the Toilet: Test behavior, not implementation
Google Testing on the Toilet: Don't overuse mocks
Vitest: Coverage guide
Playwright: Locators, Trace viewer, Retries, Reporters
Sonar: Verification gap in AI coding
SWE-bench: Evaluation guide and Harness reference
OpenAI: Introducing SWE-bench Verified
Anthropic: Building effective agents
MSR 2026: Are Coding Agents Generating Over-Mocked Tests? and Testing with AI Agents

xxx

How Not To Ask An Agent For A Fix

I did eventually get the stale homepage issue sorted. The more interesting part was how much conversational wandering it took to get there, and what that taught me about using gpt-5.4 for design thinking instead of just code generation.

xxx

The Bug Was In The Conversation

When agents write the code, the bugs hide in the conversation. A look at the three most common ways agentic workflows fail: scope leaks, contract leaks, and finish-line leak

All Bloqs

xxx

Mar 17, 2026

8 min read

Trust Is a Build Artifact: A Testing Philosophy for Agentic Work

Testing is not proof, and it is definitely not virtue. It is the system I use to turn refactors, AI-assisted code, and late-night confidence into something deterministic enough to trust.

#testing

#vitest

#playwright

#typescript

#ai

#architecture

"Testing is the art of being betrayed early, cheaply, and with good logs."

I had one of those afternoons where I opened a test file "just to check something" and then looked up to discover that the day had quietly left without me.

There is a particular brand of optimism that developers tend to invoke when we do not want to write tests:

"It is fine. I will be careful."

It is never fine. I am not careful. I am merely optimistic with excellent syntax.

That was fine advice for an era when writing code and writing tests were both slow. Both cost the same resource: my time and attention. I could budget them against each other.

For a long time I described testing in the usual pious language: quality, correctness, reliability, confidence. All true, all slightly bloodless. The more honest description is simpler.

Testing is how I buy the right to change code without behaving like I am defusing a bomb.

# First Principles

The only definition of a test I still believe is this:

A test is a deterministic evaluator over a chosen boundary.

That sentence is carrying more weight than it first appears to.

# Deterministic evaluator

A test, by contrast, is gloriously uncharitable.

It just compares behavior against expectation and either lets me proceed or embarrasses me immediately.

That is the real superpower: not intelligence, but repeatability.

# Chosen boundary

This is the part that makes testing less about "quality" as an ideal and more about boundaries as a tool.

The boundary is literal.

That is why I still find the Practical Test Pyramid and Kent C. Dodds' Testing Trophy useful. Not because they settle anything permanently, but because they force the economic question:

Where should most of my confidence come from, and what is the cheapest honest evaluator for this behavior?

These two frameworks answer that question differently, and the difference matters.

I want evaluators that are:

cheap enough to run often
specific enough to diagnose failure
broad enough to matter
stable enough to survive refactors
honest enough to reduce human verification work

Everything else is implementation detail.

What "cheap" actually means has shifted.

That arithmetic has changed. Writing is now near-free. An agent can scaffold a hundred tests while I read the pull request description. But the rest of the cost structure is unchanged or higher:

Running cost: wall-clock time in CI, or real API spend if the test invokes an LLM at each step
Maintenance: every test that breaks on a neutral refactor is work I pay for later
False confidence: a test that passes on wrong behavior is not free. It costs the debugging session I did not know I still owed

So the evaluator properties above are not a wish list from a slower era. They are a more precise checklist for the era where writing is cheap and everything after writing is not.

# What I Am Actually Optimizing For

These are the things I am actually optimizing for.

# 1. Signal

A failing test should mean something I care about.

Eventually I start ignoring that kind of failure, and once that happens the suite begins its long career as theater.

# 2. Coupling

I want tests coupled to externally meaningful behavior, not internal choreography.

Google's Testing on the Toilet said it best years ago: test behavior, not implementation. I have since translated that into my own less charitable phrasing:

If a refactor that preserves behavior breaks the suite, the suite is part of the bug.

# 3. Forensics

When something fails, I want evidence.

Not vibes. Not "works on my machine." Not a gut feeling that Playwright is being moody today.

I want:

a trace
a screenshot
the request and response shape
the exact assertion that failed
enough context that I do not need to re-run the same thing five times hoping clarity descends from heaven

Debugging is not a morality play. It is an evidence problem.

# 4. Human verification reduction

This is the one that became impossible to ignore once I started using AI to write more code.

That resonates because it matches exactly what I feel locally.

The expensive thing in software is no longer typing. The expensive thing is deciding what to trust.

# The Costs That Forced My Stack Into Shape

# I got tired of being wrong about things the computer already knew

The absolute cheapest place to be wrong is in the editor, before the code even saves.

I treat static analysis as the floor. If the types don't pass, the behavior is irrelevant.

# I got tired of re-debugging math I had already "basically" understood

The cheapest humiliation in software is pure logic drifting while I am busy feeling clever about a refactor.

Vitest is useful here for very practical reasons:

it is fast enough that I do not resent it
it sits close enough to the app that I do not feel like I am switching worlds
it lets me use coverage as a flashlight when I need one, not as a personality trait (coverage guide)

# I got tired of tests punishing me for changing presentation instead of behavior

This one was entirely self-inflicted.

That always ages badly.

can the user type here?
can they click the control that matters?
does the route change after the action?
does the visible state update in the way a person would actually perceive?

That is what Testing Library's guiding principles are really doing for me. They are not teaching me kindness. They are teaching me coupling discipline.

# I got tired of discovering architectural mistakes only after the side effects had already happened

That is what pushed me toward service-level tests.

This layer exists because I kept paying for a specific kind of uncertainty:

did validation happen before persistence?
did the side effect fire after the write or before it?
did invalid input short-circuit cleanly?
did an error produce half-finished state?

None of this is glamorous. All of it is expensive to debug late.

# I got tired of silently changing my own API and only realizing it later

Route and contract tests came out of embarrassment more than theory.

That is why route tests have become some of the highest-return tests in the repo. They force me to pin down the wire contract while the change is still fresh in my head:

what status does this route return now?
what does the error body look like?
what did I just decide the client is allowed to depend on?

# I got tired of pretending reality was cheap

Real infrastructure gives me a kind of honesty that fakes never can, and it charges accordingly.

Integration tests against a real backend buy important signal:

the schema actually matches what I think it matches
the persistence path behaves the way the application assumes
the whole thing survives contact with a real environment

They also buy:

secrets
setup cost
intermittent weirdness
the low-level anxiety that I have pointed at the wrong environment and am about to become a cautionary tale

# I got tired of local reasoning being correct in pieces and wrong in sequence

This is where Playwright enters, and it enters late on purpose.

That is what I am paying Playwright for.

Not browser automation as a concept. Not the warm feeling of having E2E. I am paying for one thing only: the right to stop theorizing and watch the whole thing behave under a real browser runtime.

And because that price is high, let's make the tool earn it:

locators, because if my selectors are brittle I am manufacturing future chores
trace viewer, because if the browser fails I want a timeline, not a séance
test retries, because flakes do not become less real when I act offended by them
test reporters, because expensive tests should at least fail with evidence

I keep E2E small for the same reason I keep lawyers expensive: I only want to invoke them when the matter genuinely deserves the full machinery.

# The Mistake That Taught Me More Than the Passing Tests

The most educational class of failures in my repo has not been "the code was broken." It has been "the test was coupled to the wrong thing."

Pagination controls taught me this in a deeply unserious but unforgettable way.

That is the sort of failure that looks annoying on the surface but is pedagogically generous.

It teaches three things at once:

The suite is telling me where it is overcoupled.
A refactor always renegotiates contracts, whether I acknowledge it or not.
Tool choice matters only insofar as it helps me express the right boundary.

That is why "test behavior, not implementation" stopped sounding like advice and started sounding like rent control.

# Agentic Engineering and the Real Bottleneck

The biggest conceptual shift for me has been this:

The limiting reagent in software work is no longer code generation. It is trustworthy evaluation.

That is why benchmark design around coding agents keeps circling back to evaluation harnesses and human validation.

SWE-bench's own evaluation guide and harness reference are useful not just as benchmark docs but as philosophy. The basic loop is brutally instructive:

apply the generated patch
run the repository's tests
judge success from the resulting behavior

That is the exact shape I want in miniature inside my own repo.

This has changed how I think about my own review process.

The ideal division of labor looks like this:

the model proposes
the tests judge
I arbitrate intent, tradeoffs, and product meaning

This approach keeps the human in the loop without forcing the human to be the loop.

# Does cheap generation change the calculus?

But we know this does not work, and the reason is structural.

# My Practical Operating Model

The philosophy only matters if it cashes out into an actual working loop.

This is the loop I keep returning to.

When I touch pure logic, I run npm test first because that is the cheapest place for reality to interrupt me.

When I touch component behavior, I still start there, because I want interaction and semantics to fail in the fast loop before I buy browser runtime.

When I touch a high-value flow, I run npm run test:e2e because that is where composition stops being theoretical.

That is not discipline for its own sake. It is me spending verification money where it buys the largest reduction in future human doubt.

# The Short Version

If I strip away all the tool names, categories, and tasteful theory links, my testing philosophy is embarrassingly simple:

cheap evaluators should carry most of the load
expensive evaluators should buy reality, not ego
tests should reduce the amount of code I must personally distrust
the agentic era makes verification, not generation, the scarce resource

The agentic era did not make the economic question cheaper. It made getting the answer wrong more expensive.

That is why I care about testing.

Not because it makes me virtuous. Not because it makes the repo look grown up. Not because a dashboard somewhere can render a coverage donut in brand colors.

I care because testing is how I turn code, especially AI-assisted code, into something I can change without negotiating with fear every single time.

# Further Reading

Practical Test Pyramid
Testing Trophy and testing classifications
Testing Library: Guiding principles
Google Testing on the Toilet: Test behavior, not implementation
Google Testing on the Toilet: Don't overuse mocks
Vitest: Coverage guide
Playwright: Locators, Trace viewer, Retries, Reporters
Sonar: Verification gap in AI coding
SWE-bench: Evaluation guide and Harness reference
OpenAI: Introducing SWE-bench Verified
Anthropic: Building effective agents
MSR 2026: Are Coding Agents Generating Over-Mocked Tests? and Testing with AI Agents

xxx

How Not To Ask An Agent For A Fix

xxx

The Bug Was In The Conversation

When agents write the code, the bugs hide in the conversation. A look at the three most common ways agentic workflows fail: scope leaks, contract leaks, and finish-line leak