Last quarter, we took on a project that sounded simple on paper: enrich 28,840 product records with AI-generated descriptions, categories, and metadata. The client had a massive e-commerce catalog migrating to a headless CMS, and every single record needed better content. What followed was a masterclass in everything that can go wrong -- and right -- when you throw tens of thousands of records at an AI API.

This isn't a theoretical guide. I'm going to walk you through the actual architecture we built, the exact costs we paid, the failure modes we hit, and the patterns that saved us. If you're considering AI bulk content enrichment for your own project, this should save you a few weeks of painful discovery.

Table of Contents

Why We Chose AI Enrichment Over Manual Work

The math was brutally simple. Our client had 28,840 product records -- each one needed a rewritten description (150-300 words), three SEO-friendly category tags, a meta description, and structured attributes extracted from unstructured text. At a conservative estimate of 8 minutes per record for a human copywriter, that's 3,845 hours of work. At $35/hour, you're looking at $134,575 and roughly 6 months of elapsed time with a small team.

We completed the AI enrichment in 11 days for under $3,200 in API costs, plus about 80 hours of engineering and QA time. Even factoring in our development hours, the total cost was roughly a tenth of the manual approach.

But here's the thing nobody tells you: the hard part isn't calling the API. It's everything around it. Data cleaning, prompt tuning, quality validation, error handling, and the inevitable edge cases that make you question your career choices.

The Architecture: How We Built the Pipeline

We built the enrichment pipeline as a Node.js application, which made sense given our team's expertise in Next.js development and TypeScript. Here's the high-level architecture:

Source CSV → Parser → Batch Queue → Claude API → Response Validator → Output Store → QA Dashboard

The Data Layer

We used SQLite as our local processing database. Sounds unsexy, right? But for batch processing like this, it's perfect. No server to manage, transactions are fast, and you can query your results easily. Each record got a status column tracking its journey:

interface EnrichmentRecord {
  id: string;
  original_title: string;
  original_description: string;
  raw_attributes: string;
  status: 'pending' | 'processing' | 'completed' | 'failed' | 'needs_review';
  enriched_description: string | null;
  enriched_categories: string[] | null;
  enriched_meta: string | null;
  structured_attributes: Record<string, string> | null;
  attempts: number;
  last_error: string | null;
  token_usage: number;
  created_at: string;
  updated_at: string;
}

The Queue System

We implemented a simple job queue using BullMQ backed by Redis. Each job represented a single record enrichment. We configured it with:

  • Concurrency: 5 parallel workers (more on why this number later)
  • Max retries: 3 per record
  • Backoff: Exponential, starting at 30 seconds
  • Job timeout: 60 seconds
const enrichmentQueue = new Queue('enrichment', {
  connection: redisConnection,
  defaultJobOptions: {
    attempts: 3,
    backoff: {
      type: 'exponential',
      delay: 30000,
    },
    timeout: 60000,
    removeOnComplete: false, // Keep for auditing
  },
});

The Processing Worker

Each worker pulled a record, constructed the prompt, called Claude's API, validated the response structure, and wrote the results back. If the response didn't match our expected JSON schema, it went into a needs_review bucket rather than silently corrupting our dataset.

Choosing Claude API for Bulk Processing

We evaluated three options before settling on Claude (specifically Claude 3.5 Sonnet, and later Claude 3.5 Haiku for simpler tasks):

Feature Claude 3.5 Sonnet GPT-4o Gemini 1.5 Pro
Input cost (per 1M tokens) $3.00 $2.50 $1.25
Output cost (per 1M tokens) $15.00 $10.00 $5.00
Rate limits (RPM, Tier 2) 1,000 500 360
JSON mode reliability Excellent Good Inconsistent
Structured output quality Best in class Very good Good
Batch API discount 50% 50% N/A

Prices as of Q1 2025. Check current pricing -- these change frequently.

We went with Claude for a few reasons. First, its instruction-following for structured output was noticeably better than the alternatives during our 500-record test run. When you're processing nearly 29K records, even a 2% improvement in format compliance saves you hundreds of manual corrections. Second, Anthropic's Batch API offered a 50% discount for non-time-sensitive work, which made the economics even more favorable.

Honestly, GPT-4o would have been fine too. The differences at this scale are more about rate limits and pricing than raw quality. But Claude's consistency with JSON output was the deciding factor.

Why We Used Both Sonnet and Haiku

Here's a trick that saved us about 40% on API costs: we didn't use the same model for everything. Product descriptions needed Sonnet's quality. But category classification and attribute extraction? Haiku handled those just fine at a fraction of the cost.

We split the enrichment into two passes:

  1. Pass 1 (Haiku): Category classification, attribute extraction, basic metadata -- $0.25/1M input, $1.25/1M output
  2. Pass 2 (Sonnet): Description rewriting, meta descriptions, SEO content -- $3.00/1M input, $15.00/1M output

Prompt Engineering at Scale

This is where most tutorials fail you. They show you a single prompt and call it a day. When you're running 28,840 records through the same prompt template, tiny flaws get amplified into massive problems.

The Prompt Template

After about 15 iterations (yes, we tracked them in git), here's the rough structure that worked:

const buildPrompt = (record: SourceRecord): string => `
You are enriching product data for an e-commerce catalog. Generate the following for the product below:

1. A product description (150-300 words, second person, benefit-focused)
2. Exactly 3 category tags from this allowed list: ${CATEGORY_LIST}
3. A meta description (120-155 characters)
4. Structured attributes as key-value pairs

Rules:
- Do NOT invent features not present in the source data
- If information is ambiguous, use the "uncertain" flag
- Match the brand's tone: professional but approachable
- Description must be unique -- do not repeat the title verbatim in the first sentence

Respond ONLY with valid JSON matching this schema:
${JSON_SCHEMA}

Source product data:
Title: ${record.title}
Existing description: ${record.description}
Raw attributes: ${record.attributes}
Price: ${record.price}
Brand: ${record.brand}
`;

Lessons on Prompts at Scale

Be absurdly specific about output format. We included the full JSON schema in every request. Yes, it adds tokens. No, don't skip it. The one time we tried relying on system instructions alone, our format compliance dropped from 97% to 81%.

Constrain the output vocabulary. For category tags, we provided an explicit allowed list. Open-ended categorization produced 847 unique categories across our test batch. The constrained version? Exactly the 42 we wanted.

Add guardrails for hallucination. Products would occasionally sprout features they didn't have. Adding "Do NOT invent features not present in the source data" reduced hallucinated attributes by about 70%. Adding the uncertain flag caught most of the remaining cases.

Temperature matters more than you think. We settled on 0.3. Lower than that and descriptions got repetitive across similar products. Higher and we started getting creative writing that didn't match the brand voice.

Rate Limits, Retries, and the Art of Not Getting Banned

This section should really be called "the part that took the most engineering time." Anthropic's rate limits are well-documented but behave differently under sustained load than you'd expect from reading the docs.

Our Rate Limit Strategy

At Tier 2 (which you get after spending $40+), Claude gives you 1,000 requests per minute and 80,000 tokens per minute. Sounds generous until you realize our average request was about 1,200 input tokens and 800 output tokens. That meant our practical limit was about 40 concurrent requests before hitting token limits.

We ran 5 concurrent workers, each processing one record at a time, with a 200ms delay between requests. This gave us roughly 15-20 requests per minute -- well under the RPM limit and comfortably within token budgets.

const rateLimiter = new Bottleneck({
  maxConcurrent: 5,
  minTime: 200, // ms between requests
  reservoir: 900, // requests per minute (leaving buffer)
  reservoirRefreshAmount: 900,
  reservoirRefreshInterval: 60 * 1000,
});

Why so conservative? Because hitting rate limits causes cascading failures. One 429 response triggers a retry, which adds to the queue, which increases concurrency pressure. We learned this the hard way during hour 3 of our first real run, when aggressive settings caused a retry storm that effectively stalled the pipeline for 45 minutes.

The Batch API Alternative

Halfway through the project, we switched partially to Anthropic's Batch API. Instead of making individual requests, you upload a JSONL file of requests and get results back within 24 hours. The tradeoff: 50% cost reduction, but you lose real-time feedback.

We used the Batch API for Pass 1 (Haiku classification) and real-time API for Pass 2 (Sonnet descriptions). This hybrid approach was the sweet spot for us -- fast feedback on the expensive creative work, batch economics on the commodity classification.

Quality Control: The Human-in-the-Loop Reality

Anyone who tells you AI enrichment is fully automated is either lying or hasn't done it at scale. We built a QA process that caught problems early and prevented garbage from making it into production.

Automated Validation

Every API response went through validation before being accepted:

const validateEnrichment = (result: EnrichmentResult): ValidationOutcome => {
  const issues: string[] = [];
  
  // Length checks
  if (result.description.length < 400 || result.description.length > 2000) {
    issues.push('description_length');
  }
  
  // Category validation
  const invalidCats = result.categories.filter(c => !ALLOWED_CATEGORIES.includes(c));
  if (invalidCats.length > 0) issues.push('invalid_categories');
  
  // Meta description length
  if (result.meta.length > 160) issues.push('meta_too_long');
  
  // Hallucination signals
  const hallucination_phrases = ['I think', 'probably', 'might be', 'as an AI'];
  if (hallucination_phrases.some(p => result.description.includes(p))) {
    issues.push('possible_hallucination');
  }
  
  // Duplicate detection (fuzzy match against already-processed records)
  if (isDuplicateDescription(result.description)) {
    issues.push('duplicate_content');
  }
  
  return {
    valid: issues.length === 0,
    issues,
    needsReview: issues.length > 0 && issues.length < 3,
    rejected: issues.length >= 3,
  };
};

Manual Review Sampling

We sampled 5% of all processed records (about 1,440) for manual review. Our QA team scored each on accuracy, brand voice, and completeness. Here are the numbers from our actual review:

Metric Score
Factual accuracy 94.2%
Brand voice match 87.6%
Format compliance 97.1%
Category accuracy 91.8%
Records needing revision 8.3%
Records completely rejected 1.9%

That 8.3% needing revision is important context. It means about 2,400 records needed human editing. Still way less than manually writing all 28,840 -- but it's not zero. Budget for it.

Real Cost Breakdown

Transparency time. Here's what we actually spent:

Cost Category Amount
Claude 3.5 Haiku (Pass 1 - Batch API) $312
Claude 3.5 Sonnet (Pass 2 - Real-time) $2,147
Failed/retry requests (~6% overhead) $189
Redis hosting (2 weeks) $15
Engineering time (80 hrs × $150) $12,000
QA review time (40 hrs × $45) $1,800
Total $16,463
API costs only $2,648

Compare that to the $134,575 estimate for fully manual work. Even including all engineering and QA time, we're at about 12% of the manual cost. And the pipeline is reusable -- the next time we run a similar project, the engineering cost drops to near zero.

The per-record API cost worked out to about $0.092. Under a dime per record for AI enrichment. That's the number that makes executives sit up in their chairs.

What We Got Wrong

Underestimating Data Cleaning

We spent 3 days just cleaning the source data before sending it to Claude. Records had HTML entities, Unicode garbage, truncated descriptions, and fields in the wrong columns. Garbage in, garbage out isn't just a cliché -- it's the fundamental law of bulk AI processing.

Not Using the Batch API from Day One

We burned about $400 extra in API costs by running Pass 1 through the real-time API before discovering the Batch API would've been half the price. Read the full documentation before you start. All of it.

Insufficient Duplicate Detection

Our initial duplicate detection was too naive -- simple string matching. Claude would generate descriptions that were structurally identical but used slightly different words for similar products. We had to implement semantic similarity checking (using embeddings) to catch these, which added a day of work.

JSON Parsing Failures

About 2.4% of responses came back with malformed JSON. Sometimes a trailing comma, sometimes an unescaped quote in a product description. We should have implemented a more forgiving JSON parser from the start instead of treating these as hard failures.

// What we should have done from day one
const parseResponse = (raw: string): EnrichmentResult | null => {
  try {
    return JSON.parse(raw);
  } catch {
    // Try to extract JSON from markdown code blocks
    const jsonMatch = raw.match(/```json?\n?([\s\S]*?)\n?```/);
    if (jsonMatch) {
      try { return JSON.parse(jsonMatch[1]); } catch { /* fall through */ }
    }
    // Try jsonrepair library as last resort
    try { return JSON.parse(jsonrepair(raw)); } catch { return null; }
  }
};

What We'd Do Differently Next Time

  1. Start with a 1,000-record pilot before committing to the full run. We did 500 and it wasn't enough to surface all the edge cases.

  2. Use structured outputs from the start. Anthropic now supports tool use with defined schemas, which eliminates most JSON parsing issues. We migrated to this halfway through and wish we'd started there.

  3. Build the QA dashboard first. We built it reactively after problems appeared. Having it from day one would've caught issues in the first 100 records instead of the first 2,000.

  4. Invest in better embeddings for dedup. We'd use a dedicated embedding model (like text-embedding-3-small) from the start for semantic duplicate detection.

  5. Consider hybrid model routing. Some records are simple (t-shirts with basic attributes) and some are complex (electronics with dozens of specs). Routing simple records to Haiku and complex ones to Sonnet -- even for descriptions -- would've saved another 20-30% on API costs.

If you're planning a similar project and want to skip the painful parts, we've built reusable pipelines for this kind of work as part of our headless CMS development practice. Happy to share more specifics.

FAQ

How long does it take to enrich 28,000+ records with AI?

Our actual processing time was about 11 days, including pipeline development, testing, processing, and QA review. The API processing itself (sending requests and getting responses) took roughly 48 hours of continuous running. If you're using the Batch API exclusively, expect 24-48 hours for processing plus 3-5 days for engineering and QA.

What's the cost per record for AI content enrichment?

Using Claude 3.5 Sonnet and Haiku in combination, our API cost was approximately $0.092 per record for generating a product description, categories, meta description, and structured attributes. Your mileage will vary based on input/output lengths and which model you choose. Batch API processing cuts this roughly in half.

Is Claude or GPT-4 better for bulk data enrichment?

Both work well. We chose Claude 3.5 Sonnet because of its superior JSON format compliance during our testing (97.1% vs ~94% for GPT-4o). However, GPT-4o is slightly cheaper for output tokens. If your enrichment is primarily classification rather than content generation, the difference is negligible. Test both with 500 records before committing.

How do you handle rate limits when making thousands of API calls?

Use a rate limiter library like Bottleneck, set conservative concurrency (5-10 parallel requests), implement exponential backoff for retries, and leave a 10-15% buffer below published rate limits. For non-time-sensitive work, Anthropic's Batch API avoids rate limit concerns entirely and costs 50% less.

What percentage of AI-enriched records need human review?

In our project, 8.3% of records needed some form of human editing and 1.9% were completely rejected and rewritten manually. Your numbers will depend on data quality, prompt engineering, and acceptable quality thresholds. Plan for 5-15% human intervention as a realistic baseline.

Can AI bulk enrichment handle multiple languages?

Yes, but quality varies significantly by language. Claude and GPT-4 handle major European languages well, but accuracy drops for less common languages. We recommend running separate prompt templates per language and having native speakers in your QA sample. Expect the human review percentage to roughly double for non-English content.

How do you prevent AI hallucinations in product data?

Three layers: prompt instructions explicitly forbidding invented features, an "uncertain" flag for ambiguous data, and automated validation comparing enriched attributes against source data. We also used semantic similarity scoring to flag descriptions that diverged too far from the original product information. This reduced hallucinated attributes by approximately 70%.

Is it worth building a custom pipeline or should I use an existing tool?

For under 1,000 records, tools like Clay, Bardeen, or even a well-structured Google Sheets + Apps Script setup can work. Beyond that, a custom pipeline pays for itself quickly. The control over retry logic, quality validation, and cost optimization that a custom solution provides is essential at scale. Our pipeline was roughly 2,000 lines of TypeScript -- not trivial, but not a massive project either. Check our pricing page if you'd like us to build one for your use case.