Most agencies outsource their content or hire a junior writer to crank out SEO posts that read like they were generated by a toaster. We tried that. It didn't work. So we built something different -- a multi-model AI pipeline that drafts, humanizes, scores, and ships articles at a pace no single writer could match, while maintaining a quality bar that actually reflects how we think about web development.

This is the story of how we shipped 91 articles in under three months, the specific tools and models we wired together, and every ugly lesson we learned along the way.

Table of Contents

Why We Built Our Own Blog Pipeline with Claude, GPT-4o & Winston AI

The Problem With Agency Content

Here's a truth nobody in the agency world wants to say out loud: most development shops are terrible at content marketing. We're no exception -- or at least, we weren't.

We had the classic problem. Our team knows how to build things with Next.js, Astro, and various headless CMS platforms. We ship real products for real clients. But writing about it? Consistently? At a cadence that actually moves the SEO needle? That's a different muscle entirely.

We tried hiring freelance writers. The technical depth was shallow. We tried having developers write posts. They'd produce one brilliant article and then disappear into a sprint for six weeks. We tried basic AI generation with ChatGPT -- the output read like a Wikipedia article had a baby with a marketing brochure.

So we asked ourselves: what if we treated content production like a software engineering problem? What if we built a pipeline?

Architecture of Our Blog Pipeline

The pipeline has five stages. Each stage has a specific model or tool responsible for it, and each produces a measurable output that feeds the next stage.

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│  Research &  │────▶│  Claude Opus  │────▶│  GPT-4o     │
│  Brief Gen   │     │  First Draft  │     │  Humanizer  │
└─────────────┘     └──────────────┘     └─────────────┘
                                                │
                                                ▼
                                         ┌─────────────┐
                                         │  Winston AI  │
                                         │  Detection   │
                                         └─────────────┘
                                                │
                                                ▼
                                         ┌─────────────┐
                                         │  Human Edit  │
                                         │  & Publish   │
                                         └─────────────┘

Stage 1: Research & Brief Generation

We use a combination of Ahrefs for keyword research and Tavily's API for real-time competitive analysis. The brief is a structured JSON document that includes:

  • Target keyword and secondary keywords
  • Top 10 competing articles (titles, word counts, H2 structures)
  • People Also Ask questions scraped from Google
  • A proposed outline with target word count per section

This brief becomes the input prompt for Claude.

Stage 2: Claude Opus First Draft

Claude Opus 4 writes the first draft. More on why below.

Stage 3: GPT-4o Humanizer Pass

The draft goes through GPT-4o with a carefully tuned system prompt designed to make the writing sound like a real person wrote it.

Stage 4: Winston AI Detection

We score every article through Winston AI. If it doesn't hit our threshold, it goes back through the humanizer with different parameters.

Stage 5: Human Edit & Publish

A real person reads every article. They check technical accuracy, add personal anecdotes where appropriate, and handle final formatting.

Why Claude Opus 4 for First Drafts

We tested every major model for first-draft generation. Here's what we found:

Model Technical Depth (1-10) Structure Quality (1-10) Avg. Word Count AI Detection Score (Winston) Cost per Article
GPT-4o 7 8 2,400 32% human $0.18
Claude Opus 4 9 9 3,100 28% human $0.42
Claude Sonnet 4 8 8 2,600 35% human $0.08
Gemini 2.5 Pro 7 7 2,800 30% human $0.14
Llama 3.1 405B 6 6 2,200 41% human $0.03

Claude Opus 4 won on the two dimensions we cared about most: technical depth and structural quality. The AI detection scores were actually worse than GPT-4o's raw output, but that didn't matter because we weren't going to publish raw output from any model.

The thing about Claude Opus that's hard to quantify in a table is this: it follows complex instructions more faithfully than anything else we tested. When we say "write like a senior developer sharing hard-won knowledge," Claude actually shifts its register. GPT-4o tends to fall back into a helpful-assistant voice no matter how hard you push it. Gemini produces decent technical content but gets weirdly formal in places.

The cost difference is real -- Opus is roughly 2-5x more expensive per token than the alternatives. But when you factor in the time saved on rewrites, it's the cheapest option overall.

The System Prompt That Made the Difference

We iterated on our Claude system prompt for about three weeks before landing on something that consistently produced good output. A few things we learned:

  1. Banning specific phrases works better than asking for a tone. Instead of saying "write in a casual tone," we maintain a list of banned words and phrases. Things like "comprehensive," "leverage," "in today's digital landscape" -- the dead giveaways of AI-generated content.

  2. Forcing structural constraints produces better content. We specify exact heading structures, require code blocks, demand markdown tables. Claude Opus follows these constraints almost perfectly.

  3. Providing real context beats generic instructions. We feed in actual competitive research. We tell Claude what the top-ranking articles cover and where they fall short. This produces content that's genuinely differentiated.

def generate_first_draft(brief: dict) -> str:
    system_prompt = load_prompt("claude_writer_v14.txt")
    
    messages = [
        {"role": "user", "content": format_brief(brief)}
    ]
    
    response = anthropic_client.messages.create(
        model="claude-opus-4-20250514",
        max_tokens=8192,
        system=system_prompt,
        messages=messages,
        temperature=0.7  # slightly creative, not chaotic
    )
    
    return response.content[0].text

We settled on a temperature of 0.7. Lower than that and the writing feels robotic. Higher and Claude starts making things up -- hallucinating framework features, inventing API endpoints that don't exist.

Why We Built Our Own Blog Pipeline with Claude, GPT-4o & Winston AI - architecture

The GPT-4o Humanizer Pass

This is where things get interesting. And a little weird.

After Claude produces a technically solid first draft, we pass it through GPT-4o with a completely different system prompt. This prompt's job isn't to add information -- it's to make the writing feel more human.

What does that actually mean in practice? A few specific transformations:

  • Sentence length variation. AI models tend to write sentences that are all roughly the same length. Humans don't do that. We instruct GPT-4o to mix short punchy sentences with longer ones.
  • Imperfect transitions. Real blog posts don't have perfect paragraph-to-paragraph flow. Sometimes you just jump to the next thought. The humanizer adds these natural breaks.
  • First-person insertions. "In our experience," "We've found that," "I spent a week debugging this" -- these small touches make a huge difference in AI detection scores.
  • Contractions. Claude Opus tends to write "do not" and "it is" even when instructed otherwise. The humanizer pass catches these and converts them.
def humanize_draft(draft: str) -> str:
    system_prompt = load_prompt("gpt4o_humanizer_v8.txt")
    
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Humanize this article while preserving all technical accuracy and structure:\n\n{draft}"}
        ],
        temperature=0.8
    )
    
    return response.choices[0].message.content

Why GPT-4o for this pass instead of Claude? Honestly, it's because GPT-4o is better at sounding casual. Claude's strength is technical precision and instruction-following. GPT-4o's strength is mimicking human writing patterns. We're playing to each model's strengths.

The Double-Model Approach Wasn't Our First Idea

We initially tried doing everything with a single model. One prompt, one pass, one output. The results were mediocre across the board. The draft was either technically strong but robotic, or conversational but shallow.

Splitting the pipeline into specialized stages was the breakthrough. It's the same principle behind microservices -- each component does one thing well.

Winston AI Detection and the 85% Threshold

We chose Winston AI as our detection tool after testing five different AI content detectors. Here's why:

Detector Consistency (same input, same score?) False Positive Rate API Available? Price/month
Winston AI High Low (~3%) Yes $18/mo
Originality.ai High Medium (~8%) Yes $15/mo
GPTZero Medium Medium (~7%) Yes $10/mo
Copyleaks Medium Low (~4%) Yes $8/mo
Sapling Low High (~12%) Yes Free tier

Winston AI gave us the most consistent scores across runs. If you feed it the same article twice, you get nearly the same human score. That matters when you're building an automated pipeline -- you need deterministic-ish behavior to make decisions.

Our threshold is 85% human score. Below that, the article goes back through the humanizer with adjusted parameters (higher temperature, different instruction emphasis). If it fails a second time, a human rewrites the flagged sections manually.

In practice, about 70% of articles pass on the first humanizer run. Another 20% pass on the second. The remaining 10% need manual intervention.

def check_detection(article: str) -> dict:
    result = winston_client.scan(text=article)
    
    return {
        "human_score": result.score,  # 0-100
        "passed": result.score >= 85,
        "flagged_sentences": result.flagged_sentences
    }

The flagged_sentences field is gold. Instead of re-running the entire article, we can target just the sentences that triggered the detector. This saves tokens and produces better results.

The Full Workflow Step by Step

Here's what actually happens when we want to publish a new article:

  1. Keyword selection -- We pull from our content calendar (maintained in Notion) and cross-reference with Ahrefs keyword difficulty scores. We target KD < 30 for new topics.

  2. Competitive research -- Our script hits Tavily's search API and pulls the top 10 results. It extracts headings, word counts, and content gaps.

  3. Brief generation -- A Claude Sonnet 4 call (cheaper than Opus for this task) generates a structured brief from the research data.

  4. First draft -- Claude Opus 4 produces the article. Takes about 45-90 seconds depending on length.

  5. Humanizer pass -- GPT-4o rewrites for voice and naturalness. Another 30-60 seconds.

  6. Detection scoring -- Winston AI scores the output. Results come back in about 10 seconds.

  7. Loop or proceed -- If score < 85%, go back to step 5 with modified parameters. Max 2 retries.

  8. Human review -- A team member reads the article, checks facts, adds screenshots or diagrams, and formats for our CMS.

  9. Publish -- Article goes live through our headless CMS pipeline.

Total time per article: about 35 minutes of human attention. The AI stages take about 3 minutes of compute time.

What 91 Articles Taught Us About AI Content

We've been running this pipeline since January 2025. Here are the patterns that emerged:

Technical Content Performs Better

Our best-performing articles are deeply technical pieces about specific frameworks and tools. Articles about Next.js development patterns or Astro performance optimization consistently outperform generic "what is headless CMS" content.

This makes sense. AI-generated generic content is everywhere now. Google's ranking algorithms are clearly favoring specificity and depth. Our pipeline is designed to produce exactly that kind of content.

The First 30 Articles Were Rough

I'm not going to pretend we nailed it from day one. The first batch of articles had issues:

  • Inconsistent voice across articles
  • Some hallucinated statistics (Claude confidently cited a "2024 Gartner report" that didn't exist)
  • Code examples that didn't compile
  • Repetitive section structures

We fixed these through prompt iteration and stricter human review. The system prompt is now on version 14. Each version addressed specific failure modes we identified in published content.

AI Detection Is a Moving Target

Winston AI updated their detection model twice during our three-month run. Each time, our scores dropped by 5-10 points and we had to adjust the humanizer prompt. This is an ongoing arms race, and if you're building something similar, plan for maintenance.

Human Review Is Non-Negotiable

We tried skipping human review for a batch of 5 articles as an experiment. Two of them had factual errors that would have embarrassed us. One referenced an API that was deprecated in 2023. Another claimed Next.js 15 supported a feature that's actually still in RFC.

Every article gets human eyes. Period.

Cost Breakdown and Performance Data

Here are the real numbers from our 91-article run:

Metric Value
Total articles published 91
Average word count 2,847
Total AI API costs $127.40
Average cost per article (AI only) $1.40
Winston AI subscription (3 months) $54.00
Ahrefs subscription (3 months) $297.00
Tavily API costs $42.00
Human review time (avg per article) 35 min
Total human hours ~53 hours
Articles passing Winston on first try 64 (70%)
Articles needing manual rewrite 9 (10%)
Average Winston AI human score (final) 89%
Organic traffic increase (Jan-Mar 2025) +340%
Indexed pages increase +86

The $1.40 per article in AI costs is remarkably low. The real expense is human time -- 53 hours across three months for review and editing. But compare that to what a freelance technical writer charges. At $0.15/word for quality technical content, a 2,847-word article would cost about $427. We're producing comparable-quality content for roughly $35 in human time (at a $40/hour rate) plus $1.40 in AI costs.

That's a 91% cost reduction. And the output is more technically accurate because the AI models have broader knowledge than any single freelance writer.

Tools We Evaluated and Rejected

Not everything we tried made it into the final pipeline:

  • Jasper AI -- Too focused on marketing copy. Couldn't produce the technical depth we needed. Also expensive at $59/month for their business tier.
  • Copy.ai -- Similar issues to Jasper. Great for ad copy, not for 3,000-word technical articles.
  • Undetectable.ai -- We tried this as a humanizer instead of GPT-4o. The output was grammatically awkward and sometimes changed the technical meaning of sentences. Hard pass.
  • Surfer SEO -- Good tool, but we preferred building our own SEO analysis with Ahrefs data. Surfer's content editor felt too constraining.
  • Perplexity API -- We tested this for the research stage. Results were good but the citation format didn't integrate well with our brief structure. Might revisit.

FAQ

Isn't this just content spam? No. Every article goes through human review for technical accuracy and genuine usefulness. We're not spinning content or publishing thin pages. Each piece targets a specific keyword with real depth. The AI handles the heavy lifting of first-draft generation, but the editorial judgment is entirely human. Check our content across the site -- we hold ourselves to the same standard we'd want from a technical blog we read.

Why not just hire writers? We still use human writers for certain pieces -- case studies, opinion pieces, and anything that requires direct client experience. But for technical explainers and comparison articles, our pipeline produces better first drafts than most freelance writers because the AI models have broader and more current technical knowledge. The economics also make it possible to publish at a volume that would be prohibitively expensive with freelancers alone.

Does Google penalize AI-generated content? Google's official stance since their March 2024 update is that they evaluate content quality regardless of how it's produced. They penalize low-quality, mass-produced content -- whether it's AI-generated or written by a content farm in a language the writer doesn't speak natively. Our content ranks because it's genuinely useful, technically accurate, and well-structured. We've seen consistent indexing and ranking improvements across our 91 articles.

What's the Winston AI human score mean exactly? Winston AI analyzes text patterns -- perplexity, burstiness, sentence structure variation, vocabulary distribution -- and produces a score from 0 to 100 representing the likelihood the text was written by a human. A score of 85 means Winston believes there's an 85% chance a human wrote it. No detector is perfect, but Winston's consistency makes it useful as a quality gate in an automated pipeline.

Could you open-source this pipeline? We've considered it. The core logic isn't that complex -- it's mostly API calls stitched together with Python. The real value is in the prompts, and those are tuned specifically to our voice and technical domain. We might release a generic version at some point. If you're interested, reach out to us.

How do you handle code examples in articles? This is one area where human review is critical. Claude Opus generates syntactically correct code about 90% of the time, but the remaining 10% includes subtle bugs, deprecated APIs, or patterns that would make an experienced developer wince. Every code block gets manually verified. For framework-specific code, we often run it locally to confirm it works.

What happens when the AI models get updated? Model updates can break everything. When Anthropic released Claude Opus 4, our prompts that worked perfectly on Claude 3 Opus needed significant rework. We maintain versioned prompts and test against a benchmark set of 10 articles whenever a model updates. Budget time for this -- it's happened three times in our three-month run.

What's next for the pipeline? We're working on adding automated screenshot generation using Playwright, integrating with our headless CMS deployment pipeline for one-click publishing, and building a feedback loop where Google Search Console data influences which topics we prioritize next. The goal is to reduce that 35-minute human review time without sacrificing quality. We'll probably write about it when it's done. Check our pricing page if you're curious about how we apply similar systematic thinking to client projects.