tech-guide 17 min read

Building a Self-Verifying Test Loop with Playwright and AI Agents

Building a Self-Verifying Test Loop with Playwright and AI Agents

I recently wired up an agentic testing workflow using Playwright on my Branden Builds website. This inspiration came from a larger project I was working on for a client and wanted to implement into my personal site to show it off. So this will be a technical deep dive into creating agents that runs lint, types, unit tests, and e2e, then checks its own work and fixes its own mistakes, without me sitting in the feedback loop.

The Problem

Right now, most people are the feedback loop, including myself. Agents are fast at writing code but not great at knowing if their output is good. I wanted to create an engineering loop that prompts itself toward a goal and keeps going until it can prove it, not feels done. For testing, "prove it" means a green suite the agent produced and verified on its own.

A Quick Reminder on Instructions

I wrote a post that quickly defines the various instruction files skills, rules, and hooks. So I won't dive into those too much. Just a quick reminder:

  1. Rules (CLAUDE.md) - is an advisory for when a new session starts.
  2. Skills - context loaded on demand IF the agent invokes it
  3. Hooks - enforced on every tool call.

Everything below is built out of those three tiers, with one more on top: the agents that orchestrate them. The layering is important here, it keeps things the most time- and cost-efficient. Rules stay tiny so they don't tax every session, the heavy knowledge lives in a skill that's pulled in only when a spec is being written, and the one thing I refuse to leave to chance is a hook, which is failure triage.

Defining “done” for the agent

An agent will happily tell you it's finished. The trick is making "finished" mean something it can't fake. In CLAUDE.md, done is a gate that includes commands that must exit zero in the order below

## Definition of done

A change is not done until these exit zero, in this order:

1. `npm run lint` — Prettier check + ESLint with `--max-warnings 0`. Warnings are errors.
2. `npm run check` — `svelte-check` with strict TypeScript.
3. `npm run test:unit:run` — Vitest single run.
4. `npm run test:e2e` — Playwright. CI-only, and gated.

No escape hatches: no `eslint-disable`, no rule downgrades, no `@ts-expect-error`, no `any`. Fix the code.

The ordering is deliberate. Do the cheap checks first, so the agent fails fast on a formatting nit before it ever boots a browser. The "no escape hatches" line is the part that keeps it honest: without it, a stuck agent will reach for eslint-disable or as any to make the gate go green, which is exactly the "feels done" failure mode I'm trying to design out. Not until it feels done. Until it can prove it.

Setting up Playwright

First, become familiar with the Playwright locators. As they state, we should be locating elements using getByRole() before anything else. This is probably the most important thing when writing tests and wanted to specifically call it out.

Two tools worth knowing when you write tests by hand

New to Playwright? I suggest learning the tool and getting your hands dirty a bit first. Prevent the AI slop by understanding what it’s creating.

  • UI Mode: a visual test runner that plays your test back step by step. Time to ditch the console.log() prehistoric triage method. Run npx playwright test --ui and you can watch every action as a step on the timeline, inspect the DOM at each one, and see exactly where an assertion gave out.
  • Codegen: npx playwright codegen http://localhost:4173 opens a browser, records what you click and type, and turns it into test code. It makes mistakes, so review the output, but it's great for generating a skeleton you then tighten up.

The Config

Everything the automated side does starts from playwright.config.ts

import { defineConfig, devices } from '@playwright/test';

export default defineConfig({
	testDir: 'tests',
	testMatch: '**/*.e2e.{ts,js}',
	fullyParallel: true,
	forbidOnly: !!process.env.CI,
	retries: process.env.CI ? 1 : 0,
	reporter: [
		['html', { outputFolder: 'playwright-report', open: 'never' }],
		['json', { outputFile: 'playwright-report/results.json' }],
		['list']
	],
	use: {
		baseURL: 'http://localhost:4173',
		trace: 'retain-on-failure',
		screenshot: 'only-on-failure',
		video: 'retain-on-failure'
	},
	projects: [{ name: 'chromium', use: { ...devices['Desktop Chrome'] } }],
	webServer: {
		command: 'npm run build && wrangler dev --port 4173 --ip 127.0.0.1',
		url: 'http://localhost:4173',
		reuseExistingServer: !process.env.CI,
		timeout: 120_000
	}
});

Notes on the config

  • The testMatch suffix was changed for my use case. I didn’t want it to conflict with my Vite unit tests. Forget the .e2e.ts suffix and Playwright won’t see my files.
  • webServer builds the real thing. This isn't pointed at vite dev . It runs npm run build and serves the result through wrangler dev, the actual Cloudflare Workers runtime. Tests run against what ships, not against a dev-server approximation of it.
  • Artifacts only show up on failure. retain-on-failure / only-on-failure for trace, screenshot, and video.
  • Retries are CI-only. Zero retries locally (a flaky test should fail loudly in front of you), one retry in CI, because real infra has real flakiness, and the dossier hook below distinguishes "failed" from "flaky, recovered on retry" rather than lumping them together.

The NPM scripts that drive it

"test:unit": "vitest",
"test:unit:run": "vitest --run",
"test": "npm run test:unit -- --run && npm run test:e2e",
"test:e2e": "playwright install && playwright test",
"test:e2e:ci": "playwright test",
"test:e2e:ui": "playwright test --ui",
"test:e2e:report": "playwright show-report",
"test:e2e:codegen": "playwright codegen <http://localhost:4173>"

Two flavors of unit test, one Vite config

Using npx sv create my-app to scaffold a svelte project loads up a config for component and unit tests. E2E is the third tier. Before it, there are two other test tiers sharing a single Vite config from the svelte kit scaffold. It splits Vitest into two test projects.

test: {
		expect: { requireAssertions: true },
		projects: [
			{
				extends: './vite.config.ts',
				test: {
					name: 'client',
					browser: {
						enabled: true,
						provider: playwright(),
						instances: [{ browser: 'chromium', headless: true }]
					},
					include: ['src/**/*.svelte.{test,spec}.{js,ts}'],
					exclude: ['src/lib/server/**']
				}
			},

			{
				extends: './vite.config.ts',
				test: {
					name: 'server',
					environment: 'node',
					include: ['src/**/*.{test,spec}.{js,ts}'],
					exclude: ['src/**/*.svelte.{test,spec}.{js,ts}']
				}
			}
		]
	}

client runs Svelte component specs in a real headless Chromium via @vitest/browser-playwright That is correct, Playwright is doing double duty here, once as Vitest's browser provider for component tests and once as the actual e2e runner. server runs everything else in plain Node. E2E is deliberately a third tier, outside Vitest entirely. It’s a different runner, different config, different failure mode (real browser, real network, real build).

Where each tier runs

E2E does not run on every PR. It runs on push to main, or on a PR explicitly labeled run-e2e — it's expensive to run. Everything else (lint, type-check, unit tests) runs on every PR. On failure, the job uploads playwright-report/ (HTML report, JSON results, traces, screenshots, videos) as a 7-day artifact.

The same logic shapes the git hooks, where e2e is purposely absent:

# .husky/pre-commit
npx lint-staged

# .husky/pre-push
npm run check
npm run test:unit:run

Pre-commit lints and formats staged files. Pre-push type-checks and runs unit tests. Neither runs e2e. E2E needs a full build and a real server boot; running that on every git push would make local development miserable. So the trade-off I make is: fast feedback locally, slow-but-thorough feedback in CI. NOTE: I can also trigger them to run if not merging into main by throwing run-e2e as a label.

Setting up AI Guardrails

A config decides where tests run. It doesn't decide whether they're any good. For that, the repo enforces its conventions with eslint-plugin-playwright plus two custom rules, scoped to tests/**/*.e2e.ts They aren't suggestions in a doc somewhere, they're errors that block a commit:

{
	files: ['tests/**/*.e2e.ts'],
	...playwright.configs['flat/recommended'],
	rules: {
		...playwright.configs['flat/recommended'].rules,
		'playwright/no-wait-for-timeout': 'error',
		'playwright/no-networkidle': 'error',
		'playwright/no-raw-locators': 'error',
		'playwright/no-element-handle': 'error',
		'playwright/no-force-option': 'error',
		'playwright/no-page-pause': 'error',
		'playwright/prefer-web-first-assertions': 'error',
		'playwright/valid-expect': 'error',
		'no-restricted-syntax': [
			'error',
			{
				selector: "CallExpression[callee.property.name='all']:not([callee.object.name='Promise'])",
				message:
					'locator.all() has no auto-waiting — use `await expect(locator).toHaveCount(n)` (it retries), then `locator.evaluateAll(...)` if you need the elements.'
			},
			{
				selector: "CallExpression[callee.property.name='toPass'][arguments.length=0]",
				message:
					'toPass() defaults to a 0ms timeout (retries once, never waits) — pass an explicit timeout, e.g. `.toPass({ timeout: 5000 })`.'
			}
		]
	}
}

The two custom rules came from real mistakes, not theory. locator.all() resolves immediately with no auto-waiting, so it's a built-in source of flakiness dressed up as a convenience method. toPass() with no arguments defaults to a 0ms timeout, meaning it retries exactly once and gives up; it looks like a retry helper and behaves like a no-op. Both got banned with a message written as an instruction, not a complaint. When the agent (or me) hits the lint error, the message says what to do instead, not just what's wrong. That's a small thing, but it means the linter is teaching, not just blocking.

Not everything fits in a lint rule, so the rest is written into every agent's instructions:

  • Locator order is fixed: getByRole → getByLabel → getByPlaceholder → getByText → getByTestId → never a raw CSS selector. playwright/no-raw-locators enforces the "never" part.
  • No waitForTimeout, no waitForLoadState('networkidle'). Both are banned outright. Timing is handled exclusively by auto-retrying expect(locator) assertions. await expect(x).toBeVisible() polls until it's true or times out, which is strictly more reliable than guessing a sleep duration.
  • Never edit an assertion to match broken UI. This is bolded because this is what I see AI do ALL.THE.TIME. If the test is right and the app is wrong, the fix is test.fixme() with a comment explaining the real bug and not loosening the assertion until it passes. A green suite that's lying to you is worse than a red one that isn't.
  • One purpose per support directory. tests/data/ holds seed JSON, tests/fixtures/ holds Playwright fixtures and recorded artifacts, tests/helpers/ holds shared TypeScript helpers. If two specs need the same literal or logic, it gets pulled into one of these and imported — never copy-pasted across files.
  • The SSR caveat. page.route only intercepts client-side requests. This site's blog content comes from Storyblok via a server-side load function, which page.route simply never sees. So specs assert structure and navigation. "Does the heading render, does the link go where it says" . It never tests never specific CMS copy, because that copy can change without a code change and would make the test flaky for the wrong reason.

Turning traces into text for diagnostic failures

This is the first place the cost-of-failure problem gets solved directly, and the one piece I'd call genuinely novel rather than standard Playwright hygiene. Every spec in this repo imports test/expect from a custom fixture instead of @playwright/test:

import { test as base, expect } from '@playwright/test';

export const test = base.extend({
	page: async ({ page }, use, testInfo) => {
		const consoleMessages: string[] = [];
		const networkIssues: string[] = [];

		page.on('console', (msg) => {
			const type = msg.type();
			if (type === 'error' || type === 'warning') {
				consoleMessages.push(`[${type}] ${msg.text()}`);
			}
		});
		page.on('pageerror', (err) => {
			consoleMessages.push(`[pageerror] ${err.message}`);
		});
		page.on('requestfailed', (req) => {
			const reason = req.failure()?.errorText ?? 'unknown';
			networkIssues.push(`[requestfailed] ${req.method()} ${req.url()} — ${reason}`);
		});
		page.on('response', (res) => {
			if (res.status() >= 400) {
				networkIssues.push(`[${res.status()}] ${res.request().method()} ${res.url()}`);
			}
		});

		await use(page);

		if (testInfo.status !== testInfo.expectedStatus) {
			if (consoleMessages.length > 0) {
				await testInfo.attach('console', {
					body: consoleMessages.join('\n'),
					contentType: 'text/plain'
				});
			}
			if (networkIssues.length > 0) {
				await testInfo.attach('network', {
					body: networkIssues.join('\n'),
					contentType: 'text/plain'
				});
			}
		}
	}
});

export { expect };

Here's the problem this solves. A Playwright trace is a fantastic tool for a human clicking through a timeline in show-trace. It is a terrible tool for an agent, because reading one means unzipping a binary, walking a proprietary format, and burning a pile of tokens to extract one TypeError. This fixture sidesteps that entirely because it listens for console errors and warnings, uncaught pageerrors, and any 4xx/5xx response, and on failure only, attaches them as plain-text attachments named console and network. Forwarding goes through testInfo.attach(), never console.*, so it doesn't trip the repo's own no-console lint rule on its own diagnostic code (yes, it would have, and yes, that's a fun bug to find).

Writing Real Specs

With the config, the rules, and the fixture in place, the specs themselves stay short. Three real ones, three different shapes of test.

tests/home.e2e.ts — note the import: every spec pulls test/expect from the diagnostics fixture, not @playwright/test:

import { test, expect } from './fixtures/diagnostics';

test.describe('Home page', () => {
	test('primary navigation is visible', async ({ page }) => {
		await page.goto('/');
		await expect(page.getByRole('navigation')).toBeVisible();
	});

	test('page has a heading', async ({ page }) => {
		await page.goto('/');
		await expect(page.getByRole('heading', { level: 1 })).toBeVisible();
	});

	test('/blog page loads', async ({ page }) => {
		await page.goto('/blog');
		await expect(page).toHaveURL('/blog');
		await expect(page.getByRole('heading', { level: 1 })).toBeVisible();
	});
});

The dossier hook: for agents to glance at

The diagnostics fixture makes the right information exist as plain text. The dossier hook is what puts it in front of the agent without it asking. This is the piece that turns "agent reads a trace" into "agent reads a markdown summary." It's a PostToolUse hook in .claude/settings.json, wired to fire after every Bash tool call:

"hooks": {
	"PostToolUse": [
		{
			"matcher": "Bash",
			"hooks": [
				{ "type": "command", "command": ".claude/hooks/playwright-dossier.sh", "statusMessage": "Building failure dossier..." }
			]
		}
	]
}

The script (.claude/hooks/playwright-dossier.sh) stays silent unless three things are all true: the Bash command actually contained playwright testplaywright-report/results.json reports failures, and that results.json is fresher than the last one it processed (tracked via a .dossier-seen marker file at the project root). That last check exists because Playwright wipes test-results/ and regenerates playwright-report/ on every run, keeping it freeeesh and not stale.

When it does fire, it parses the JSON reporter output with jq and emits a dossier like this:

# Playwright failure dossier — 1 failing

## opens on nav link click
- spec: tests/contact-modal.e2e.ts:12
- project: chromium
- repro: npx playwright test --project=chromium tests/contact-modal.e2e.ts -g "opens on nav link click"
- screenshot: test-results/contact-modal-opens-chromium/test-failed-1.png
- trace: npx playwright show-trace test-results/contact-modal-opens-chromium/trace.zip

error:
Error: expect(locator).toBeVisible() failed
Locator: getByRole('dialog', { name: 'Get in touch' })
Expected: visible
Received: <element(s) not found>

console:
[pageerror] Cannot read properties of undefined (reading 'open')

The AI Overlords Agents

Alright, we went super into depth about scaffolding and guardrails. We defined what DONE means.The agents are what actually drive the loop. There are three in .claude/agents/, each scoped to one phase of the test lifecycle.

playwright-test-planner reads src/routes/<path>/+page.svelte and its components, then writes a numbered scenario plan to tests/plans/<feature>.md. It never writes a spec or runs anything. Its plan format is deliberately implementation-agnostic:

### 1. <Scenario name>

- **Precondition:** <page state before the test>
- **Locator intent:** <what role/label/text identifies the element — not a CSS selector>
- **Action:** <what the user does>
- **Expected outcome:** <what should be true afterward>

playwright-test-generator picks one scenario from a plan, reads the real route source, and writes tests/<scenario>.e2e.ts. Then it runs npx playwright test tests/<scenario>.e2e.ts --reporter=line and iterates until green, never editing an assertion to paper over a bug. Once green, it runs npm run lint && npm run check before calling it done.

playwright-test-healer is invoked when a previously-passing spec breaks. It runs npx playwright test --last-failed, but before proposing any fix it reads the dossier's console/network attachments first (cheaper than the trace), then the accessibility-tree dump Playwright prints on a failed role query. If the role/name is genuinely gone from the DOM, the fix is to repair the component's accessible name and not to fall back to a getByTestId or CSS escape hatch. If the behavior itself is broken, it stops and marks test.fixme() with a bug note instead of forcing green.

One skill, loaded on demand

The agents don't carry their own copy of "how to write a good locator." They all point at one shared skill, .claude/skills/playwright-cli/SKILL.md, [NOTE: this is modified from the Playwright CLI skills] loaded only when one of them is actually about to touch a spec:

## Non-negotiables (enforced by eslint-plugin-playwright)

1. **Locators** — use `getByRole` first, always. `locator(css)` is banned.
2. **Waiting** — never `waitForTimeout` or `waitForLoadState('networkidle')`.
3. **webServer** — owns startup only; never embed seed/migration. Explicit `baseURL`.
4. **Reporting + traces** — HTML + JSON + list reporters; trace/screenshot/video on failure.
5. **CLI over MCP** — use `npx playwright …` only.

This is the three-tier model from the top of the post, now doing real work. Rules (CLAUDE.md) stay short and general. Skills are the deep, topic-specific knowledge such as locator order, waiting semantics, reporting setup, a CLI cheatsheet and it gets pulled into context only when an agent is actually about to write or fix a spec, instead of bloating every single session. Hooks, like the dossier script, are the one tier that's enforced rather than advisory. The agents are the orchestration layer on top of all three: planner and generator both explicitly say "follow the playwright-cli skill," and the healer leans on the hook's dossier before it leans on anything else.

Putting it together

If you made it this far, you might be asking: okay cool, that's a lot of setup, but how does it actually work? Here's the whole loop, start to finish, using something that's actually missing from this repo right now, my contact modal coverage.

Step 1 — plan it. I tell the planner: "write a test plan for the contact modal." It reads the relevant component and writes tests/plans/contact-modal.md [UPDATE URL]

Step 2 — generate it. I point the generator at scenario 1. It reads the plan, reads the modal component for the real role/name, and writes tests/contact-modal.e2e.ts [UPDATE URL]

While running, it found an obscure accessibility bug! This is why it sometimes works better to grab pre-made components… accessibility and modals are difficult… Will have to dive into this but it proves my test healer is working as expected too.

// BUG: createHashDialog's trigger-tracking click listener (src/lib/state/hashDialog.svelte.ts)
// is registered inside a `$effect`, which mounts well after the page becomes
// interactive/clickable. Clicking the header "contact" link before that listener registers
// means `triggerEl` is never set, so `handleClose`'s `trigger?.focus()` is a silent no-op and
// focus falls back to `<body>` instead of returning to the trigger. Repro showed the
// click-vs-listener race lands the click ~400ms before the capture listener registers
// (~470ms) on a fresh page load — reproduced in ~80% of runs locally. Fix belongs in the app
// (register the trigger-tracking listener earlier, e.g. outside the `$effect`, or track the
// trigger via the click handler bound directly on each `a[href="#contact-modal"]` link rather
// than a document-level listener that races hydration).
test.fixme('2b. closing the modal via the Close button returns focus to the trigger', async ({ page }) => {
	await page.goto('/');

	const headerNav = page.getByRole('navigation').first();
	const trigger = headerNav.getByRole('link', { name: 'contact' });
	await trigger.click();

	const dialog = page.getByRole('dialog');
	await expect(dialog).toBeVisible();

	await dialog.getByRole('button', { name: 'Close' }).click();

	await expect(dialog).toBeHidden();
	await expect(trigger).toBeFocused();
});

So there it is, in about 20 minutes we have a full e2e test suite on my modal that even found an accessibility bug. It even pointed me toward a possible fix!

Diving deeper into Playwright

My website is pretty basic, it's just a headless CMS talking to one API (a guess two technically with the form but thats just a client side public call). However, you can dive much deeper into Playwright. This is a very powerful tool and just some notes of how I used it in more complicated codebases.

  1. Creating auth storage states into a JSON file to reuse.
  2. Fixtures - creating specific users (based off auth user JSON above), state (empty cart, cart items, etc) and so much more.
  3. Using HARs to record third party API routes as JSON for mocking so you can easily make your own mocks.
  4. Visual regressions with screenshots.
  5. Accessibility scans with Axe and Playwright.

Conclusion

In my opinion, Playwright with AI might be one of the most powerful tools for creating User Interfaces. With verification loops, we can catch anything broken before it ships. I just scratched the surface in this post, enjoy diving in deeper. I highly recommend the master.dev course by Steve Kinney I learned a ton from it and I'm now applying it across several projects.

// 06.contact

Bring your vision into focus.

From initial market strategy to deep code implementation, together we design and build full-lifecycle digital ecosystems. Drop a brief, a deck, or an open problem, and let Branden Builds do his thing.

Start a project

Let's get to work.