Make Your Content AI-Ready Without Migrating to Sanity
There's a narrative floating around the CMS world right now that goes something like this: "If you want AI-ready content, you need Sanity's structured content approach." And look, Sanity's content lake and their GROQ-powered AI integrations are genuinely impressive. But here's the thing -- most teams can't just abandon their existing CMS. You've got years of content in WordPress. Your app's data layer lives in Supabase. You just finished migrating to Payload CMS six months ago. The idea of another migration makes your stomach turn.
Good news: you don't need to switch. You need to think differently about how your content is structured, stored, and exposed. I've spent the last year helping teams retrofit their existing stacks for AI consumption, and the patterns are surprisingly consistent regardless of which CMS or database you're running. Let me walk you through it.
Table of Contents
- What "AI-Ready Content" Actually Means
- Why Sanity Gets All the Attention
- Structuring Content for AI in WordPress
- Payload CMS: You're Closer Than You Think
- Supabase as an AI-Ready Content Layer
- The Universal Principles of AI-Ready Content
- Building an AI Abstraction Layer
- Vector Embeddings Without a Full Migration
- Real-World Architecture Patterns
- FAQ
What "AI-Ready Content" Actually Means
Before we get tactical, let's clarify what we're actually talking about. "AI-ready content" isn't a marketing buzzword (well, it is, but there's substance underneath). It means your content meets three criteria:
- Machine-parseable structure -- AI models can reliably extract meaning from your content without guessing at context
- Rich metadata -- Every piece of content carries enough semantic information that an AI can understand relationships, intent, and context
- API accessibility -- Content is available through programmatic interfaces that AI agents, RAG pipelines, and LLM tool-calling can consume
That's it. Notice what's not on the list: a specific vendor. These are architectural patterns, not product features.
The Content Intelligence Spectrum
Think of content AI-readiness on a spectrum:
| Level | Description | Example |
|---|---|---|
| 0 | Blob of HTML | WordPress post with inline styles and mixed media |
| 1 | Separated concerns | Clean HTML with structured data markup |
| 2 | Field-level structure | Content broken into typed fields (title, summary, body, author) |
| 3 | Semantic relationships | Content with explicit references, taxonomies, and entity links |
| 4 | AI-native | Content with embeddings, semantic annotations, and machine-readable intent |
Sanity's structured content model nudges you toward Level 3-4 by default. But every CMS can reach Level 3, and with some additional infrastructure, Level 4.
Why Sanity Gets All the Attention
Let's give credit where it's due. Sanity's approach to structured content is genuinely well-designed for AI use cases:
- Portable Text stores rich text as a JSON AST rather than HTML, making it trivial to parse programmatically
- GROQ queries return exactly the shape of data you need, which maps perfectly to LLM context windows
- Content Lake treats content as a graph of typed documents with explicit references
- Their AI SDK integrations in 2025 allow direct tool-calling from LLMs into content queries
But here's what the Sanity evangelists don't mention: these advantages are architectural patterns, not proprietary magic. You can implement every single one of these in your existing stack. It just takes intentional design.
The real question isn't "should I switch to Sanity?" It's "how do I apply structured content principles where I already am?"
Structuring Content for AI in WordPress
WordPress powers something like 43% of the web in 2025. If you're running WordPress, you're in good company, and you've got more options than you might think.
Step 1: Stop Using the Classic Editor for Everything
The Gutenberg block editor already stores content as structured blocks. Each block has a type, attributes, and content. This is closer to Sanity's Portable Text than most people realize.
{
"blockName": "core/paragraph",
"attrs": {},
"innerBlocks": [],
"innerHTML": "<p>This is structured content, not just HTML.</p>",
"innerContent": ["<p>This is structured content, not just HTML.</p>"]
}
The block data is stored as serialized comments in post_content, but you can parse it programmatically:
$blocks = parse_blocks($post->post_content);
$structured = array_map(function($block) {
return [
'type' => $block['blockName'],
'attributes' => $block['attrs'],
'content' => strip_tags($block['innerHTML']),
];
}, array_filter($blocks, fn($b) => $b['blockName'] !== null));
Step 2: Invest in Custom Fields and Taxonomies
Advanced Custom Fields (ACF) or Meta Box give you Level 2-3 content structure. But you need to be intentional about it. Don't just add fields -- design a content model.
// Register a structured content type for AI consumption
register_post_type('knowledge_article', [
'supports' => ['title', 'custom-fields'],
'show_in_rest' => true, // Critical for API access
]);
// Define semantic fields
acf_add_local_field_group([
'title' => 'AI-Ready Content Fields',
'fields' => [
['key' => 'summary', 'label' => 'Summary', 'type' => 'textarea'],
['key' => 'key_concepts', 'label' => 'Key Concepts', 'type' => 'taxonomy', 'taxonomy' => 'concept'],
['key' => 'content_intent', 'label' => 'Content Intent', 'type' => 'select', 'choices' => [
'informational' => 'Informational',
'transactional' => 'Transactional',
'navigational' => 'Navigational',
]],
['key' => 'related_entities', 'label' => 'Related Entities', 'type' => 'relationship'],
],
]);
Step 3: Expose Everything Through the REST API
WordPress REST API is your bridge to AI. Make sure custom fields are exposed:
add_action('rest_api_init', function() {
register_rest_field('knowledge_article', 'ai_metadata', [
'get_callback' => function($post) {
return [
'summary' => get_field('summary', $post['id']),
'concepts' => wp_get_post_terms($post['id'], 'concept', ['fields' => 'names']),
'intent' => get_field('content_intent', $post['id']),
'related' => get_field('related_entities', $post['id']),
'structured_blocks' => parse_blocks(get_post_field('post_content', $post['id'])),
];
},
]);
});
If you're running WordPress as a headless CMS with a Next.js or Astro frontend (which is something we do a lot at Social Animal), this REST API becomes your AI's primary interface.
Step 4: Add JSON-LD Structured Data
This one's often overlooked for AI readiness, but it matters. Google's AI Overviews and other AI crawlers consume JSON-LD. Tools like Yoast SEO or RankMath generate basic schema, but for real AI readiness, you want to output detailed structured data:
{
"@context": "https://schema.org",
"@type": "TechArticle",
"headline": "Make Your Content AI-Ready",
"abstract": "How to structure existing CMS content for AI consumption",
"about": [
{"@type": "Thing", "name": "Content Management"},
{"@type": "Thing", "name": "Artificial Intelligence"}
],
"mentions": [
{"@type": "SoftwareApplication", "name": "WordPress"},
{"@type": "SoftwareApplication", "name": "Payload CMS"}
]
}
Payload CMS: You're Closer Than You Think
If you're already on Payload CMS, congratulations -- you're probably at Level 2-3 without much extra work. Payload's collection-based architecture with typed fields is inherently structured.
Why Payload Is Already AI-Friendly
Payload stores content as typed JSON documents in MongoDB or Postgres. Every field has a defined type. Relationships are explicit. This is exactly what AI needs.
// Payload collection that's already AI-ready
const Articles: CollectionConfig = {
slug: 'articles',
fields: [
{ name: 'title', type: 'text', required: true },
{ name: 'summary', type: 'textarea' },
{ name: 'body', type: 'richText' }, // Stored as Slate/Lexical JSON
{ name: 'topics', type: 'relationship', relationTo: 'topics', hasMany: true },
{ name: 'contentType', type: 'select', options: ['guide', 'tutorial', 'reference'] },
],
};
Payload's rich text editor (Lexical in v3.x) stores content as a JSON AST -- just like Sanity's Portable Text. You already have structured content.
Adding AI-Specific Fields to Payload
The gap between Payload and full AI-readiness is mostly about metadata. Add these fields to your collections:
const aiFields: Field[] = [
{
name: 'aiMetadata',
type: 'group',
fields: [
{ name: 'embedding', type: 'json', admin: { hidden: true } },
{ name: 'extractedEntities', type: 'json', admin: { readOnly: true } },
{ name: 'semanticSummary', type: 'textarea', admin: { readOnly: true } },
{ name: 'contentHash', type: 'text', admin: { hidden: true } },
],
},
];
Then use Payload's hooks to auto-generate embeddings on save:
const generateEmbeddingHook: CollectionAfterChangeHook = async ({ doc, operation }) => {
if (operation === 'create' || operation === 'update') {
const textContent = extractTextFromLexical(doc.body);
const embedding = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: `${doc.title}\n${doc.summary}\n${textContent}`,
});
await payload.update({
collection: 'articles',
id: doc.id,
data: {
aiMetadata: {
...doc.aiMetadata,
embedding: embedding.data[0].embedding,
contentHash: hashContent(textContent),
},
},
});
}
};
This is essentially what Sanity's AI features do under the hood. You're just doing it yourself. For teams building on Payload with Next.js, this pattern integrates naturally into your existing deployment pipeline.
Supabase as an AI-Ready Content Layer
Supabase is interesting because it's not a CMS -- it's a database platform. But increasingly, teams use it as their content backend, especially with Supabase's pgvector extension for embeddings.
The pgvector Advantage
Supabase has had pgvector support since 2023, and it's matured significantly. This means you can store content AND vector embeddings in the same database:
-- Enable the extension
create extension if not exists vector;
-- Create a content table with embedding support
create table content (
id uuid default gen_random_uuid() primary key,
title text not null,
body text not null,
metadata jsonb default '{}',
content_type text not null,
embedding vector(1536), -- OpenAI text-embedding-3-small dimension
created_at timestamptz default now(),
updated_at timestamptz default now()
);
-- Create an index for similarity search
create index on content using ivfflat (embedding vector_cosine_ops)
with (lists = 100);
Building a Content API for AI Agents
Supabase's auto-generated REST API plus Edge Functions give you everything you need:
// Supabase Edge Function for AI content retrieval
import { createClient } from '@supabase/supabase-js';
Deno.serve(async (req) => {
const { query, limit = 5 } = await req.json();
const supabase = createClient(Deno.env.get('SUPABASE_URL')!, Deno.env.get('SUPABASE_KEY')!);
// Generate embedding for the query
const embeddingResponse = await fetch('https://api.openai.com/v1/embeddings', {
method: 'POST',
headers: {
'Authorization': `Bearer ${Deno.env.get('OPENAI_API_KEY')}`,
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'text-embedding-3-small',
input: query,
}),
});
const { data } = await embeddingResponse.json();
const queryEmbedding = data[0].embedding;
// Semantic search using pgvector
const { data: results } = await supabase.rpc('match_content', {
query_embedding: queryEmbedding,
match_threshold: 0.7,
match_count: limit,
});
return new Response(JSON.stringify(results), {
headers: { 'Content-Type': 'application/json' },
});
});
The Postgres function for similarity matching:
create or replace function match_content(
query_embedding vector(1536),
match_threshold float,
match_count int
) returns table (
id uuid,
title text,
body text,
metadata jsonb,
similarity float
) language sql stable as $$
select
content.id,
content.title,
content.body,
content.metadata,
1 - (content.embedding <=> query_embedding) as similarity
from content
where 1 - (content.embedding <=> query_embedding) > match_threshold
order by content.embedding <=> query_embedding
limit match_count;
$$;
This gives you a fully functional RAG (Retrieval-Augmented Generation) backend without any CMS migration. Your content lives in Supabase, your AI can query it semantically, and your Astro or Next.js frontend can consume it through the same API.
The Universal Principles of AI-Ready Content
Regardless of your CMS, these principles apply:
1. Separate Content from Presentation
This is the single biggest thing you can do. If your content is tangled with HTML, CSS classes, and layout concerns, AI can't reliably parse it. Store content as data, render it as HTML at the presentation layer.
2. Type Everything
Every field should have an explicit type. Don't use generic "text" fields for structured data. A date should be stored as a date. A reference should be a reference, not a slug string pasted into a text field.
3. Make Relationships Explicit
If Article A references Product B, that should be a typed relationship -- not a mention in the body text. AI tools need to traverse your content graph, and they can't do that with implied links.
4. Add Semantic Metadata
Go beyond basic SEO metadata. Include:
- Content intent (informational, transactional, navigational)
- Audience segment
- Confidence/freshness indicators
- Entity annotations
- Topic classifications beyond basic categories
5. Version and Timestamp Everything
AI systems need to know how fresh content is. Include created_at, updated_at, and ideally a valid_until or review_date field. Stale content in a RAG pipeline leads to hallucinations.
Building an AI Abstraction Layer
Here's the pattern I keep coming back to: instead of migrating your CMS, add an AI abstraction layer on top of it.
[WordPress/Payload/Supabase] → [Content Sync] → [AI Layer (pgvector/Pinecone)] → [AI Consumers]
The AI layer:
- Syncs content from your CMS via webhooks or polling
- Normalizes it into a consistent structure regardless of source
- Generates embeddings and stores them alongside the normalized content
- Exposes an AI-optimized API for RAG, tool-calling, and semantic search
// Simplified content sync pipeline
interface NormalizedContent {
id: string;
source: 'wordpress' | 'payload' | 'supabase';
sourceId: string;
title: string;
body: string; // Plain text, stripped of markup
structuredBody: object; // JSON AST if available
metadata: {
type: string;
intent: string;
topics: string[];
entities: string[];
createdAt: string;
updatedAt: string;
};
embedding?: number[];
}
async function syncContent(source: ContentSource): Promise<void> {
const rawContent = await source.fetchAll();
for (const item of rawContent) {
const normalized = source.normalize(item);
const embedding = await generateEmbedding(
`${normalized.title}\n${normalized.body}`
);
await aiLayer.upsert({
...normalized,
embedding,
});
}
}
This approach has a huge advantage: your editors keep using the CMS they know. No retraining, no migration, no downtime. The AI layer lives alongside your existing stack.
Vector Embeddings Without a Full Migration
Let's talk costs and tooling for 2025, because this matters for real-world decisions:
| Embedding Provider | Model | Cost per 1M tokens | Dimensions | Notes |
|---|---|---|---|---|
| OpenAI | text-embedding-3-small | $0.02 | 1536 | Best cost/quality ratio |
| OpenAI | text-embedding-3-large | $0.13 | 3072 | Higher accuracy |
| Cohere | embed-v4 | $0.10 | 1024 | Good multilingual support |
| Voyage AI | voyage-3 | $0.06 | 1024 | Strong for code content |
| Local (Ollama) | nomic-embed-text | Free | 768 | Privacy-first option |
For a typical content site with 5,000 articles averaging 1,500 words each, you're looking at roughly 7.5M tokens. With OpenAI's small model, that's $0.15 to embed your entire content library. Even re-embedding weekly is negligible.
Vector Storage Options
| Solution | Free Tier | Pricing (2025) | Best For |
|---|---|---|---|
| Supabase pgvector | 500MB database | $25/mo for 8GB | Teams already on Supabase |
| Pinecone | 5M vectors | $70/mo starter | Production RAG at scale |
| Qdrant Cloud | 1GB cluster | $25/mo | Advanced filtering needs |
| Weaviate Cloud | 50k objects | $25/mo | Multi-modal content |
| Turbopuffer | 1M vectors | Pay-per-query | Cost-sensitive projects |
If you're already running Supabase, pgvector is the obvious choice. No additional service, no additional billing, no additional point of failure.
Real-World Architecture Patterns
Let me share two architectures I've actually built:
Pattern 1: WordPress + Supabase AI Layer
For a media company with 50k+ WordPress posts:
- WordPress webhook fires on post save/update
- A Supabase Edge Function receives the webhook
- Content is fetched via WP REST API, normalized, and embedded
- Stored in Supabase with pgvector
- AI chatbot on the Next.js frontend queries Supabase for semantic search
- Results are passed to GPT-4o as context for answer generation
Total additional infrastructure cost: ~$25/month for Supabase pro tier.
Pattern 2: Payload CMS with Built-in AI
For a SaaS documentation site on Payload v3:
- Payload hooks generate embeddings on every document save
- Embeddings stored in a
vectorcolumn in the same Postgres database Payload uses - Custom Payload endpoint for semantic search
- AI docs assistant powered by the same database
- No external vector store needed
Total additional infrastructure cost: $0 beyond the OpenAI API calls (pennies per month).
Both patterns took about 2-3 weeks to implement, compared to the 3-6 months a full CMS migration would take. If you're considering this kind of architecture, we've got pricing tiers that cover exactly these types of projects.
FAQ
Do I really need to restructure my content for AI, or is it just hype?
It's not hype, but the urgency depends on your use case. If you're building AI features (chatbots, semantic search, personalization), structured content is essential. If you're optimizing for AI-driven search like Google's AI Overviews or ChatGPT's browsing, structured data and clean content hierarchies measurably improve your visibility. A 2025 study by Authoritas found that pages with schema markup were 40% more likely to appear in AI-generated answers.
What's the minimum I should do to make WordPress content AI-ready?
Three things: (1) Use Gutenberg blocks consistently instead of pasting HTML, (2) add JSON-LD structured data to every page, and (3) expose custom fields through the REST API. This gets you from Level 0-1 to Level 2-3 in a few weeks of focused work. You don't need to restructure your entire site overnight.
Can Payload CMS replace Sanity for AI-powered content?
For most use cases, yes. Payload v3 with Lexical rich text stores content as structured JSON, has typed fields and relationships, and supports Postgres with pgvector. The main thing Sanity offers that Payload doesn't have natively is the managed Content Lake with built-in AI features. But if you're willing to wire up your own embedding pipeline (which takes maybe a day), Payload gives you equivalent capabilities.
How much does it cost to add vector embeddings to an existing CMS?
Surprisingly little. For a site with 10,000 articles, initial embedding generation with OpenAI's text-embedding-3-small costs about $0.30. Ongoing costs for re-embedding updated content are typically under $5/month. The vector storage is the bigger cost -- expect $0-70/month depending on your provider and scale. Supabase's free tier can handle many small-to-medium sites.
Should I use a separate vector database or store embeddings in my existing database?
If you're on Postgres (which Payload v3 and Supabase both use), store embeddings in the same database with pgvector. One less service to manage, one less sync to break. Dedicated vector databases like Pinecone make sense when you have millions of documents or need sub-millisecond query times. For most content sites, pgvector is more than fast enough -- typical query times are 5-20ms for collections under 1M vectors.
How do I keep AI embeddings in sync with content changes?
Webhooks are your friend. Every modern CMS supports them. When content is created or updated, fire a webhook that triggers re-embedding. Store a content hash alongside the embedding so you can skip unchanged content. For WordPress, use the save_post action. For Payload, use afterChange hooks. For Supabase, use database triggers or Realtime subscriptions.
What about content in multiple languages -- does this approach still work?
Yes, but choose your embedding model carefully. OpenAI's text-embedding-3 models handle multilingual content well. Cohere's embed-v4 is specifically optimized for cross-lingual retrieval. The normalization layer should store the language code as metadata so your AI consumers can filter appropriately. One important note: embed each language version separately rather than concatenating translations.
Is migrating to a headless CMS a prerequisite for AI-ready content?
Not a prerequisite, but it helps enormously. Headless CMS architecture naturally separates content from presentation, which is the foundation of AI readiness. If you're still running a monolithic WordPress theme with content baked into template files, going headless (WordPress as a backend with a Next.js or Astro frontend) simultaneously improves your AI readiness and your frontend performance. It's often worth the investment even before considering AI use cases. If you want to explore this, reach out to us -- it's literally what we do every day.