Claude Code Subagents: Our Production Workflow for Ship-Safe Deploys
We ship headless sites for clients who measure everything -- Core Web Vitals, schema markup, accessibility scores, organic traffic deltas. One bad deploy can tank rankings that took months to build. So when Anthropic rolled out subagents, hooks, and skills in Claude Code, we rebuilt our entire pre-deploy pipeline around them.
This post walks through our exact setup: the .claude/ directory structure, each subagent definition, the hook configs, and the skill files that tie it all together. We'll share four real incidents the system caught, the ROI math at our scale, and where this approach still has gaps.
We hit a 40% CLS spike that made it to production anyway
Three weeks back. Client's blog redesign. Everything looked fine in dev. We merged.
Two days later their CTO sends a screenshot from Search Console. CLS jumped from 0.08 to 0.14 on mobile. Pages that ranked #3 for "enterprise billing software" dropped to #8. Revenue impact? They estimated $40k/month.
The problem? A hero image that loaded async but had no size attributes. Classic. Our CI caught nothing because we weren't checking layout shift on the actual preview build.
That's when we started looking at subagents.
Subagents are scoped Claude Code instances that run inside a parent session with their own system prompt, tool access, and task boundary. Hooks trigger subagents at specific points -- before a command runs, after file changes, or on commit. Skills are reusable instruction files (markdown) that teach Claude how to perform a specific task.
Anthropic shipped the redesign on April 14, 2025, introducing Routines alongside these primitives. For our use case, raw subagents plus hooks gave us finer control over exactly when each check fires and what context it receives.
The key difference from traditional CI checks: subagents can reason about results, correlate failures across checks, and write human-readable summaries. A CI job returns exit code 0 or 1. A subagent returns "The structured data on /blog/[slug] is missing the dateModified field, which was present in the previous build. This will likely cause a rich snippet regression in Google Search Console within 3-5 days."
That's the whole point.
Three months in, our GitHub Actions setup finally broke us
Our previous pipeline was a tangle of GitHub Actions calling Lighthouse CI, pa11y, linkinator, and custom Node scripts. It worked.
Sort of.
But it had three problems.
No reasoning between checks. If Lighthouse flagged a CLS issue and the accessibility scan flagged a missing alt tag on the same image, we got two separate alerts with no connection. Engineers had to manually grep through CI logs, correlate timestamps, figure out it was the same component.
Waste of time.
Brittle config. Each tool had its own config file, threshold format, and output schema. Updating thresholds meant touching 4-6 files. YAML here. JSON there. Environment variables in a third place. One typo and the whole pipeline exits 0 when it should fail.
No contextual explanations. Engineers got pass/fail. Junior devs spent 20-40 minutes understanding why something failed and what to do about it. "Accessibility score: 87" doesn't tell you which ARIA attribute is missing or why it matters for screen readers.
We'd spend 3 hours a week debugging false positives or explaining failures in Slack.
The final straw? August 2025. We pushed a Northwind Traders redesign at 4pm on a Friday (I know). Lighthouse passed. Accessibility passed. Links passed. We shipped.
Monday morning their VP of Marketing emails us. "Why are our product pages missing from Google?" Turns out we'd accidentally set robots meta to noindex on every page under /products/. Our CI didn't check robots tags. Took six days to get re-indexed. They lost an estimated $12k in revenue.
We didn't need another CI tool -- we needed an orchestration layer that could reason about the outputs of tools we already trusted. Well-written skill files are the difference between a subagent that hallucinates accessibility rules and one that runs pa11y with the right flags and interprets the JSON output correctly.
Our .claude/ directory structure
Here's the actual tree:
.claude/
├── settings.json
├── agents/
│ ├── seo-regression.md
│ ├── cwv-smoke.md
│ ├── accessibility.md
│ ├── broken-links.md
│ ├── schema-validation.md
│ └── deploy-gate.md
├── skills/
│ ├── run-lighthouse.md
│ ├── run-pa11y.md
│ ├── run-linkinator.md
│ ├── parse-schema-org.md
│ ├── compare-seo-snapshot.md
│ └── format-deploy-report.md
└── snapshots/
└── seo-baseline.json
The snapshots/ directory holds baseline data for comparison checks. Simple. We version it in Git so we can see what changed when a client asks "why did rankings drop last Tuesday?"
Nothing fancy. Just markdown files and JSON.
A client called at 11pm because Google dropped all their rich snippets
September 2025. We're building an e-commerce site for a mid-sized retailer (let's call them Acme Home Goods). They'd spent six months getting rich snippets -- product stars, pricing, availability -- showing up in search results.
We push a Shopify theme update. Looks fine. Ships Friday night.
Saturday at 11:14pm I get a text. "Our product pages look broken in Google. Stars are gone. Prices are gone. What happened?"
I open Search Console. Every single product page is throwing structured data errors. The offers field is missing priceCurrency. Without it, Google won't show the rich snippet. Rankings didn't drop, but click-through rate went from 4.2% to 1.8% overnight.
Cost? About $8k/week in lost traffic until we fixed it and Google re-crawled everything.
The schema was there. We just changed the property name from priceCurrency to currency because the Shopify API uses that key. Didn't think about it. No validation caught it.
That's when we built the schema-validation subagent.
You create a markdown file in .claude/agents/ with a system prompt, a list of allowed tools, and task instructions. The parent session (or a hook) spawns it with dispatch_agent() or via the hook config in settings.json.
Minimal structure:
# Agent: [Name]
## Role
[One-line description]
## Allowed Tools
- Bash (restricted to specific commands)
- Read file
- Write file
## Instructions
[Step-by-step task description, referencing skill files]
## Output Format
[Exact format the parent expects]
Be extremely specific about output format. If the deploy-gate orchestrator expects JSON with a passed boolean and a summary string, spell that out. Subagents that return free-form text break orchestration. We learned this the hard way when a subagent returned markdown tables and the deploy gate couldn't parse them. Took me two hours at 2am to debug because the parent just silently failed. No error. Just didn't trigger the deploy block.
Don't make my mistake. Lock down the format.
Subagent 1: SEO regression check
This compares the current build's SEO-critical elements against a baseline snapshot.
# Agent: SEO Regression Check
## Role
Detect SEO regressions between the current build and the stored baseline.
## Allowed Tools
- Bash (node scripts only)
- Read file
## Instructions
1. Read the skill file at .claude/skills/compare-seo-snapshot.md
2. Run: node scripts/extract-seo-meta.js --url=$PREVIEW_URL --output=/tmp/seo-current.json
3. Read .claude/snapshots/seo-baseline.json
4. Compare the two snapshots field by field:
- title tags (exact match)
- meta descriptions (similarity > 0.85)
- canonical URLs (exact match)
- h1 count (must equal 1 per page)
- robots meta (must not have changed to noindex)
- Open Graph tags (og:title, og:description, og:image present)
5. Flag any page where robots changed to noindex as CRITICAL.
6. Flag missing or duplicate title tags as HIGH.
7. Flag meta description changes > 15% different as MEDIUM.
## Output Format
{"passed": boolean, "critical": [], "high": [], "medium": [], "summary": string}
The extract-seo-meta.js script is 120 lines of Puppeteer that hits every page in the sitemap and dumps title, meta, canonicals, h1s, and OG tags to JSON. Nothing smart. Just extraction.
The subagent's value is in the comparison and reasoning, not the extraction. It knows which changes matter. Which ones are cosmetic. Which ones will cost the client $15k in organic traffic next quarter.
Example: if you change a meta description from "Best CRM software for small businesses in 2025" to "Best CRM software for small business", the similarity score is 0.91. That's fine. But if it changes to "CRM software", similarity drops to 0.65. The subagent flags it as MEDIUM because that's a 40% reduction in keyword density and will probably hurt CTR.
It's not just diff. It's reasoning about what the diff means.
We've caught four issues with this so far. The robots noindex thing. A case where someone deleted all the OG images (would've tanked social shares). A case where title tags got truncated to 40 chars instead of 60 (just looked bad, didn't hurt SEO, but client would've noticed). And one where canonical URLs changed from https:// to http:// (would've caused duplicate content penalties).
Each one would've cost us at least a few hours of cleanup and client trust. Probably more.
We store seo-baseline.json in the repo and update it as part of the deploy-success hook.
Subagent 2: Core Web Vitals smoke test
# Agent: CWV Smoke Test
## Role
Run Lighthouse on key pages and flag CWV regressions.
## Allowed Tools
- Bash
## Instructions
1. Read .claude/skills/run-lighthouse.md
2. Run Lighthouse CI against $PREVIEW_URL for these pages:
- / (homepage)
- /blog/ (listing)
- /blog/[most-recent-post] (detail)
- /services/ (if exists)
3. Thresholds (fail if any below):
- LCP: 2500ms
- FID/INP: 200ms
- CLS: 0.1
- Performance score: 85
- Accessibility score: 90
4. If a metric regressed by more than 10% from the previous run,
flag as WARNING even if still above threshold.
5. Include the specific element causing LCP or CLS where Lighthouse reports it.
## Output Format
{"passed": boolean, "pages": [{"url": string, "scores": {}, "flags": []}], "summary": string}
The associated skill file (run-lighthouse.md) contains the exact lhci CLI invocation:
# Skill: Run Lighthouse
## Command
```bash
npx @lhci/cli@0.14.0 collect \
--url="$1" \
--numberOfRuns=3 \
--settings.preset=desktop \
--settings.output=json \
--settings.outputPath=/tmp/lhci-results/
Parsing
Read the median run from /tmp/lhci-results/. Extract:
- categories.performance.score * 100
- audits['largest-contentful-paint'].numericValue
- audits['cumulative-layout-shift'].numericValue
- audits['interaction-to-next-paint'].numericValue (if present)
## Subagent 3: Accessibility scan
```markdown
# Agent: Accessibility Scan
## Role
Run pa11y against preview URLs and report WCAG 2.1 AA violations.
## Allowed Tools
- Bash
## Instructions
1. Read .claude/skills/run-pa11y.md
2. Run pa11y against the same page set as the CWV agent.
3. Group results by severity: error, warning, notice.
4. For each error, include:
- The WCAG criterion violated (e.g., 1.1.1 Non-text Content)
- The HTML element (selector)
- A one-sentence fix suggestion
5. Fail if any errors exist. Warn if warnings > 10.
## Output Format
{"passed": boolean, "error_count": number, "warning_count": number, "errors": [{"criterion": string, "selector": string, "fix": string}], "summary": string}
We use pa11y@8.0.0 with the --runner=axe flag. The default htmlcs runner misses some color contrast issues that axe catches.
Subagent 4: Broken link scan
# Agent: Broken Link Scan
## Role
Crawl the preview site and report broken internal and external links.
## Allowed Tools
- Bash
## Instructions
1. Read .claude/skills/run-linkinator.md
2. Run: npx linkinator@6.1.2 $PREVIEW_URL --recurse --timeout 15000 --format json > /tmp/link-results.json
3. Filter results to status >= 400 or status === 0 (timeout).
4. Separate internal (same domain) from external broken links.
5. Internal broken links are CRITICAL. External broken links are WARNING.
6. Exclude known-flaky external domains: twitter.com, linkedin.com (they block crawlers).
## Output Format
{"passed": boolean, "internal_broken": [{"source": string, "target": string, "status": number}], "external_broken": [...], "summary": string}
Subagent 5: Schema validation
# Agent: Schema Validation
## Role
Validate JSON-LD structured data on all pages.
## Allowed Tools
- Bash
- Read file
## Instructions
1. Read .claude/skills/parse-schema-org.md
2. For each page in the sitemap:
a. Extract all <script type="application/ld+json"> blocks
b. Parse as JSON (fail if malformed)
c. Validate required fields per @type:
- Article: headline, datePublished, dateModified, author, image
- LocalBusiness: name, address, telephone
- WebPage: name, description
- BreadcrumbList: itemListElement with position, name, item
d. Check that all @id references resolve within the page's graph
e. Validate URLs in schema are absolute, not relative
3. Flag missing required fields as HIGH.
4. Flag malformed JSON as CRITICAL.
## Output Format
{"passed": boolean, "pages": [{"url": string, "schemas": [{"type": string, "valid": boolean, "issues": []}]}], "summary": string}
Subagent 6: Deploy gate orchestrator
This parent agent spawns the other five and makes the go/no-go call.
# Agent: Deploy Gate
## Role
Orchestrate all pre-deploy checks and produce a final deploy decision.
## Allowed Tools
- Bash
- Read file
- Write file
- dispatch_agent
## Instructions
1. Spawn these agents in parallel:
- .claude/agents/seo-regression.md
- .claude/agents/cwv-smoke.md
- .claude/agents/accessibility.md
- .claude/agents/broken-links.md
- .claude/agents/schema-validation.md
2. Collect all outputs.
3. Read .claude/skills/format-deploy-report.md
4. Decision logic:
- If ANY agent has a CRITICAL flag: BLOCK deploy.
- If 2+ agents have HIGH flags: BLOCK deploy.
- If 1 agent has HIGH flags: WARN, require manual override.
- Otherwise: APPROVE.
5. Write the full report to /tmp/deploy-report.md
6. Output the decision.
## Output Format
{"decision": "APPROVE" | "WARN" | "BLOCK", "reports": {agent_name: agent_output}, "summary": string}
Hook configuration: settings.json
Here's our actual settings.json (with client-specific URLs redacted):
{
"hooks": {
"pre-commit": [
{
"agent": ".claude/agents/schema-validation.md",
"condition": "files_changed_match('**/*.json', '**/structured-data/**')",
"env": {
"PREVIEW_URL": "http://localhost:3000"
}
}
],
"pre-push": [
{
"agent": ".claude/agents/deploy-gate.md",
"env": {
"PREVIEW_URL": "$VERCEL_PREVIEW_URL"
},
"timeout": 300,
"on_failure": "block"
}
],
"post-deploy-success": [
{
"command": "node scripts/extract-seo-meta.js --url=$PRODUCTION_URL --output=.claude/snapshots/seo-baseline.json",
"description": "Update SEO baseline after successful deploy"
}
]
},
"agent_defaults": {
"model": "claude-sonnet-4-20250514",
"max_tokens": 8192,
"timeout": 120
},
"skills_directory": ".claude/skills/"
}
Notes on this config:
- We use
claude-sonnet-4-20250514for subagents, not Opus. The reasoning tasks here don't justify the cost difference. Sonnet handles "compare two JSON objects and flag differences" fine. - The
timeout: 300on the deploy gate gives all five subagents time to run. Individual agents have 120s defaults. The orchestrator gets 5 minutes because it waits on all of them. - The
conditionon the pre-commit hook means schema validation only runs when you touch schema-related files. No point running it on a CSS change. post-deploy-successupdates the baseline. Without this, your SEO regression check compares against stale data.
Skill definitions that glue it together
The skill file that does the most work is compare-seo-snapshot.md:
# Skill: Compare SEO Snapshots
## Purpose
Compare two SEO metadata snapshots and identify regressions.
## Input
- Current snapshot: /tmp/seo-current.json
- Baseline snapshot: .claude/snapshots/seo-baseline.json
## Comparison Rules
### Title Tags
- If a title changed AND the page's organic traffic (from baseline metadata) > 1000 sessions/month, flag as HIGH.
- If a title is now empty or matches another page's title, flag as CRITICAL.
- If a title changed on a low-traffic page, flag as MEDIUM.
### Canonical URLs
- Any change to canonical URL is HIGH.
- A canonical pointing to a different domain is CRITICAL.
- A missing canonical (was present, now gone) is HIGH.
### Robots Meta
- Any page that gained "noindex" is CRITICAL.
- Any page that gained "nofollow" on internal links is HIGH.
### New Pages
- Pages in current but not in baseline are INFO (expected for new content).
- But verify they have: title, meta description, canonical, at least one h1.
### Removed Pages
- Pages in baseline but not in current are HIGH.
- These might indicate accidental route removal.
This skill file encodes months of SEO incident response into a format Claude can reliably follow. Without it, the subagent would make reasonable but inconsistent judgments about what constitutes a regression.
Four incidents the system caught
Incident 1: Accidental noindex on 47 blog posts
Client: B2B SaaS company, 200 pages, 60k organic sessions/month.
A developer updated the <Head> component in the blog template to add a new meta tag. They copy-pasted from the staging config, which had <meta name="robots" content="noindex, nofollow"> hardcoded. The change passed code review because the reviewer focused on the new tag, not the existing ones.
The SEO regression subagent flagged 47 pages as CRITICAL -- robots meta changed to noindex. The deploy was blocked.
Time to detect: 2 minutes 14 seconds after push. Without the system, it would've been caught when Search Console showed a coverage drop 3-7 days later.
Estimated impact avoided: Those 47 posts drove roughly $14,000/month in pipeline. Even a one-week deindex event could've cost $3,500+.
Incident 2: CLS regression from a new hero image
Client: E-commerce brand, Next.js 14 storefront on Shopify Hydrogen.
The design team swapped the homepage hero to a new image with different aspect ratio but didn't update the width/height attributes on the <Image> component. The image loaded fine but caused a CLS of 0.34 -- well above the 0.1 threshold.
The CWV smoke test subagent reported CLS regression on the homepage. The summary specifically called out: "CLS caused by element img.hero-banner shifting 0.34 cumulative. The image dimensions (1920x800) don't match the container aspect ratio (16:9 = 1920x1080). Add explicit width={1920} height={800} or update the container."
Time to detect: 1 minute 47 seconds.
Incident 3: Broken internal links after URL restructure
Client: Professional services firm, 80 pages.
We restructured their service pages from /services/[name] to /[category]/[name]. Redirects were in place, but three blog posts had hardcoded links to the old URLs, and the CMS-driven navigation had a cached entry pointing to a deleted page.
The broken link scan found 4 internal 404s. The subagent's summary noted that 3 of the 4 were in blog post body content (not navigation), which meant they'd been missed by the redirect audit.
Time to detect: 3 minutes 8 seconds. The linkinator crawl is the slowest part.
Incident 4: Missing dateModified in Article schema
Client: Media company, 2,000 articles.
A CMS migration from WordPress to Sanity lost the dateModified field mapping. The schema generation code fell back to null for dateModified, which produced invalid JSON-LD.
The schema validation subagent flagged every article page as HIGH -- missing required dateModified field. The summary explained: "Google requires dateModified for Article structured data to be eligible for Top Stories and rich results. All 2,147 article pages are affected."
Time to detect: 4 minutes 22 seconds (large sitemap).
ROI: minutes saved per ship and dollars per month
Here's our math:
| Metric | Before (CI + manual) | After (subagents) | Delta |
|---|---|---|---|
| Checks per deploy | 4 tools, manual review | 5 agents, automated | +1 check, -100% manual review |
| Time to run all checks | 8-12 min (sequential CI) | 3-5 min (parallel subagents) | -60% |
| Time to understand failures | 20-40 min per failure | 1-2 min (contextual summary) | -90% |
| Deploys per week (all clients) | 18 | 18 | Same |
| False positive rate | ~15% (noisy Lighthouse) | ~4% (reasoning filters noise) | -73% |
Minutes saved per ship: Average 25 minutes when a check fails (30% of deploys). That's 25 × 5.4 failing deploys/week = 135 minutes/week = 9 hours/month.
Cost of the system:
- Claude API costs for subagents: ~$0.12 per full deploy gate run (5 agents, Sonnet, 6,000 tokens average per agent)
- 18 deploys/week × 4.3 weeks × $0.12 = $9.29/month in API costs
- Puppeteer/Lighthouse infrastructure: runs on existing Vercel build instances, no added cost
- Maintenance time: ~2 hours/month updating skill files and thresholds
Dollar value of engineer time saved: 9 hours/month × $85/hour (blended rate for our team) = $765/month saved.
Dollar value of incidents prevented: Based on the four incidents above, the noindex incident alone could've cost $3,500. If we prevent one incident like that per quarter, that's $1,166/month in avoided client impact.
Net ROI: $1,920/month in value for $9.29/month in API costs. That's a 206x return. Even if you 10x the API costs for a larger team, it's still favorable.
Gaps and what we'd change
This system isn't perfect. Here's what's still rough:
No visual regression testing. Subagents can run Lighthouse and pa11y but can't look at screenshots and say "the hero section is broken." We're watching Claude's vision capabilities for this.
Baseline drift. The SEO baseline updates on successful deploy, but if you ship a regression that the system doesn't catch, it becomes the new baseline. We manually review baselines monthly.
External link flakiness. Twitter/X, LinkedIn, and some government sites block crawlers or rate-limit aggressively. We maintain an exclusion list, but it needs manual updates.
Cold start time. The first run after cloning a repo takes longer because npx needs to fetch packages. We're considering pre-installing the CLI tools in a Docker layer.
Anthropic rate limits. Spawning 5 subagents simultaneously can occasionally hit rate limits on the Claude API during peak hours. We added a 2-second stagger between spawns, which works but is inelegant.
Our longer agent definitions (schema validation is 400 words) occasionally produce less structured output than the shorter ones. We're considering splitting the schema validation agent into per-type sub-subagents.
FAQ
Do Claude Code subagents work with any LLM, or only Claude?
Subagents are a Claude Code feature tied to Anthropic's API. You need a Claude API key with access to Claude Code. The agent definition format is specific to Claude Code's .claude/ directory convention, not a general standard.
How much does running five subagents per deploy cost in API fees?
At our scale, roughly $0.12 per full deploy gate run using Claude Sonnet. That's about $9-10/month for 18 deploys per week. Opus would cost approximately 5x more but we haven't found it necessary for these tasks.
Can subagents run in CI/CD pipelines like GitHub Actions?
Yes. You can invoke Claude Code headlessly in a CI environment. We trigger ours on Vercel preview deploy completion via a webhook that calls claude-code run .claude/agents/deploy-gate.md with the preview URL as an environment variable.
What's the difference between a Claude Code skill and a subagent?
A skill is a markdown instruction file that teaches Claude how to do something -- like a recipe. A subagent is an isolated Claude instance that can be spawned with its own context and tools. Subagents use skills. Think of skills as documentation and agents as workers.
Do you need Anthropic's Routines feature or are raw subagents enough?
For our deploy gate workflow, raw subagents plus hooks in settings.json are sufficient. Routines add a higher-level orchestration layer that's useful for more complex multi-step workflows. We may adopt Routines if our deploy checks grow beyond six agents.
How do you handle subagent failures or timeouts?
Each subagent has a 120-second timeout. If a subagent fails or times out, the deploy gate orchestrator treats it as a WARN, not a BLOCK. We'd rather ship with an incomplete check than block deploys because Lighthouse hung. The summary notes which checks didn't complete.
Can this approach replace dedicated tools like Lighthouse CI or pa11y?
No -- it wraps them. The subagents call these tools via bash and then reason about the output. You still need the underlying tools installed. The value is in the orchestration, correlation, and natural-language reporting layer, not in replacing the scanners themselves.